超越C：使用神经机器翻译的可重新编译

论文标题

超越C：使用神经机器翻译的可重新编译

Beyond the C: Retargetable Decompilation using Neural Machine Translation

论文作者

Hosseini, Iman, Dolan-Gavitt, Brendan

论文摘要

逆转汇编过程的问题，解码是计算机软件反向工程的重要工具。最近，研究人员提出了使用神经机器翻译的技术来自动化该过程的反编译。尽管这种技术具有针对更广泛的源和组装语言的承诺，但迄今为止，它们主要针对C代码。在本文中，我们认为，现有的神经分解器以需要特定语言的领域知识（例如代币者和解析器）来为源语言构建抽象语法树（AST），从而实现了更高的准确性，这增加了支持新语言的开销。我们探索了一个不同的权衡，该折衷在可能的范围内将汇编和源语言视为纯文本，并表明这使我们能够构建一个易于重新定位的新语言的分配器。我们评估了我们的原型分解剂，除C（BTC）外，在GO，Fortran，Ocaml和C上，并检查了参数的影响，例如代币化和培训数据选择对解次的质量的影响，发现它可以实现可比较的分解结果对具有显着较小跨性别知识的神经镇压的相当分解结果。我们将发布我们的培训数据，经过培训的拆卸模型和代码，以帮助鼓励对语言不合时宜的解说的未来研究。

The problem of reversing the compilation process, decompilation, is an important tool in reverse engineering of computer software. Recently, researchers have proposed using techniques from neural machine translation to automate the process in decompilation. Although such techniques hold the promise of targeting a wider range of source and assembly languages, to date they have primarily targeted C code. In this paper we argue that existing neural decompilers have achieved higher accuracy at the cost of requiring language-specific domain knowledge such as tokenizers and parsers to build an abstract syntax tree (AST) for the source language, which increases the overhead of supporting new languages. We explore a different tradeoff that, to the extent possible, treats the assembly and source languages as plain text, and show that this allows us to build a decompiler that is easily retargetable to new languages. We evaluate our prototype decompiler, Beyond The C (BTC), on Go, Fortran, OCaml, and C, and examine the impact of parameters such as tokenization and training data selection on the quality of decompilation, finding that it achieves comparable decompilation results to prior work in neural decompilation with significantly less domain knowledge. We will release our training data, trained decompilation models, and code to help encourage future research into language-agnostic decompilation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题