黑盒机器翻译系统的模仿攻击和防御

论文标题

黑盒机器翻译系统的模仿攻击和防御

Imitation Attacks and Defenses for Black-box Machine Translation Systems

论文作者

Wallace, Eric, Stern, Mitchell, Song, Dawn

论文摘要

对手可能希望窃取或攻击Black-Box NLP系统，无论是为了经济增长还是利用模型错误。特别感兴趣的一种设置是机器翻译（MT），其中模型具有较高的商业价值和错误可能是昂贵的。我们调查了黑盒MT系统的可能漏洞，并探讨了针对此类威胁的初步辩护。我们首先表明，MT系统可以通过用单语句子和培训模型查询来模仿其输出来偷走它们。使用模拟实验，我们证明即使模仿模型具有与目标模型不同的输入数据或体系结构，MT模型窃取也是可能的。应用这些想法，我们训练在高资源和低资源语言对的三个生产MT系统中达到0.6 BLEU的模仿模型。然后，我们利用模仿模型的相似性将对抗性示例转移到生产系统中。我们使用基于梯度的攻击，这些攻击暴露了输入，这些输入会导致语义上的翻译，内容下降和庸俗的模型输出。为了减轻这些漏洞，我们提出了一种辩护，该防御会修改翻译输出，以误导模仿模型的优化。这种防守使对手的BLEU得分和攻击成功率降低了，以某种代价在防守者的BLEU和推理速度中。

Adversaries may look to steal or attack black-box NLP systems, either for financial gain or to exploit model errors. One setting of particular interest is machine translation (MT), where models have high commercial value and errors can be costly. We investigate possible exploits of black-box MT systems and explore a preliminary defense against such threats. We first show that MT systems can be stolen by querying them with monolingual sentences and training models to imitate their outputs. Using simulated experiments, we demonstrate that MT model stealing is possible even when imitation models have different input data or architectures than their target models. Applying these ideas, we train imitation models that reach within 0.6 BLEU of three production MT systems on both high-resource and low-resource language pairs. We then leverage the similarity of our imitation models to transfer adversarial examples to the production systems. We use gradient-based attacks that expose inputs which lead to semantically-incorrect translations, dropped content, and vulgar model outputs. To mitigate these vulnerabilities, we propose a defense that modifies translation outputs in order to misdirect the optimization of imitation models. This defense degrades the adversary's BLEU score and attack success rate at some cost in the defender's BLEU and inference speed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题