论文标题
Ernie-Code:超越以英语为中心的编程语言的跨语性预读
ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages
论文作者
论文摘要
使用相同编程语言(PL)工作的软件工程师可能会说不同的自然语言(NLS),反之亦然,为沟通和工作效率建立了巨大的障碍。最近的研究表明,生成预培训在计算机程序中的有效性,但它们始终以英语为中心。在这项工作中,我们迈向大型语言模型(LLMS)的多语言NLS和多语言PL之间的差距。我们发布了Ernie-Code,这是一个统一的预训练的语言模型,用于116 NLS和6个PLS。我们采用两种方法来进行通用的跨语言预训练:跨腐败语言建模,从单语NL或PL学习模式;基于枢轴的翻译语言建模依赖于许多NLS和PL的并行数据。广泛的结果表明,在广泛的代码智能任务中,Ernie-Code的PL或NL的多语言LLMS优于先前的多语言LLM,包括多语言代码到文本,文本对代码,代码对代码,代码对代码以及文本到文本之间的生成。我们进一步展示了它的优势,即在多语言代码摘要和文本到文本翻译上提示。我们发布代码和预训练的检查点。
Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.