论文标题
GraphCodebert:带有数据流的预训练代码表示
GraphCodeBERT: Pre-training Code Representations with Data Flow
论文作者
论文摘要
针对编程语言的预训练模型已经在各种与代码相关的任务(例如代码搜索,代码完成,代码摘要等)上实现了巨大的经验改进。但是,现有的预培养的预培训模型将代码片段视为代币的顺序,而忽略了代码的固有结构,这些结构可提供重要的代码和代码的代码结构。我们提出了GraphCodebert,这是一种针对编程语言的预培训模型,它考虑了代码的固有结构。我们在训练前阶段使用数据流,而不是采用代码的句法级结构,它是代码的语义级结构,该结构编码了变量之间的“ where-comes-comes-from”的关系。这样的语义级结构是整洁的,并没有带来不必要的深层层次结构,其属性使模型更有效。我们根据变压器开发GraphCodebert。除了使用掩盖语言建模的任务外,我们还介绍了两个结构感知的预训练任务。一种是预测代码结构边缘,另一个是在源代码和代码结构之间对齐表示。我们以图形引导的蒙版注意功能以有效的方式实现该模型,以结合代码结构。我们在四个任务上评估了我们的模型,包括代码搜索,克隆检测,代码翻译和代码改进。结果表明,代码结构和新引入的预训练任务可以改善GraphCodebert,并在四个下游任务上实现最先进的性能。我们进一步表明,该模型在代码搜索任务中更喜欢结构级的注意力而不是令牌级别的关注。
Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code snippet as a sequence of tokens, while ignoring the inherent structure of code, which provides crucial code semantics and would enhance the code understanding process. We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. Instead of taking syntactic-level structure of code like abstract syntax tree (AST), we use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. Such a semantic-level structure is neat and does not bring an unnecessarily deep hierarchy of AST, the property of which makes the model more efficient. We develop GraphCodeBERT based on Transformer. In addition to using the task of masked language modeling, we introduce two structure-aware pre-training tasks. One is to predict code structure edges, and the other is to align representations between source code and code structure. We implement the model in an efficient way with a graph-guided masked attention function to incorporate the code structure. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement. Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks. We further show that the model prefers structure-level attentions over token-level attentions in the task of code search.