Codebert：用于编程和自然语言的预培训模型

论文标题

Codebert：用于编程和自然语言的预培训模型

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

论文作者

Feng, Zhangyin, Guo, Daya, Tang, Duyu, Duan, Nan, Feng, Xiaocheng, Gong, Ming, Shou, Linjun, Qin, Bing, Liu, Ting, Jiang, Daxin, Zhou, Ming

论文摘要

我们提出了Codebert，这是一种用于编程语言（PL）和NAT-Aran语言（NL）的双峰预训练模型。 Codebert学习了通用表示表示，这些表示，支持下游NL-PL应用程序，例如自然语言codesearch，代码文档生成等。我们开发了具有基于变压器的神经架构的Codebert，并使用混合目标功能进行训练，该功能包含了更换的令牌检测的预训练任务，该任务可检测来自生成器的Plaausible替代方案。这使我们能够同时利用NL-PL对的双峰数据和单峰数据，其中前者为模型训练提供了输入令牌，而后者有助于学习更好的生成器。我们通过微调模型参数在两个NL-PL应用程序上评估Codebert。结果表明，Codebert在自然语言代码搜索和代码文档生成任务上都实现了最新的性能。此外，为了研究Codebert中学到的知识类型，我们构建了一个用于NL-PL探测的数据集，并在零拍设置中进行评估，其中固定了预训练模型的参数。结果表明，在NL-PL探测器上，Codebert的性能比以前的预训练模型更好。

We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题