论文标题
Codebert:用于编程和自然语言的预培训模型
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
论文作者
论文摘要
我们提出了Codebert,这是一种用于编程语言(PL)和NAT-Aran语言(NL)的双峰预训练模型。 Codebert学习了通用表示表示,这些表示,支持下游NL-PL应用程序,例如自然语言codesearch,代码文档生成等。我们开发了具有基于变压器的神经架构的Codebert,并使用混合目标功能进行训练,该功能包含了更换的令牌检测的预训练任务,该任务可检测来自生成器的Plaausible替代方案。这使我们能够同时利用NL-PL对的双峰数据和单峰数据,其中前者为模型训练提供了输入令牌,而后者有助于学习更好的生成器。我们通过微调模型参数在两个NL-PL应用程序上评估Codebert。结果表明,Codebert在自然语言代码搜索和代码文档生成任务上都实现了最新的性能。此外,为了研究Codebert中学到的知识类型,我们构建了一个用于NL-PL探测的数据集,并在零拍设置中进行评估,其中固定了预训练模型的参数。结果表明,在NL-PL探测器上,Codebert的性能比以前的预训练模型更好。
We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both bimodal data of NL-PL pairs and unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation tasks. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NL-PL probing.