使用详细术语的COQ引理名称的深刻一代

论文标题

使用详细术语的COQ引理名称的深刻一代

Deep Generation of Coq Lemma Names Using Elaborated Terms

论文作者

Nie, Pengyu, Palmskog, Karl, Li, Junyi Jessy, Gligoric, Milos

论文摘要

对于开发人员来说，必须有效地理解，审查和修改大型软件项目中的源代码，需要用于命名，间隔和其他本质上具有风格属性的编码约定。基于证明助理（例如COQ）的验证项目的一致惯例，随着项目规模和范围的增长，重要性会增加。虽然可以以高昂的代价记录和执行惯例，但新兴的方法通过在大型代码语料库上应用统计语言模型来自动学习并以类似Java的语言提出惯用名称。但是，由于其强大的语言扩展设施以及类型检查和计算的融合，COQ是自动学习技术的具有挑战性的目标。我们提出了学习和建议COQ项目的引理名称的新颖生成模型。我们的模型基于多输入神经网络，是第一个利用Coq's Lexer（引理语句中的令牌），Parser（语法树）和内核（精心说明）的句法和语义信息的模型；关键见解是，从详细术语中学习可以大大提高模型性能。我们在工具链中实现了模型，称为公鸡，并将其应用于从数学组成部分的大型代码中，以其严格的编码约定而闻名。我们的结果表明，公鸡在建议引理名称方面大大优于基准，强调了使用多输入模型和详细术语的重要性。

Coding conventions for naming, spacing, and other essentially stylistic properties are necessary for developers to effectively understand, review, and modify source code in large software projects. Consistent conventions in verification projects based on proof assistants, such as Coq, increase in importance as projects grow in size and scope. While conventions can be documented and enforced manually at high cost, emerging approaches automatically learn and suggest idiomatic names in Java-like languages by applying statistical language models on large code corpora. However, due to its powerful language extension facilities and fusion of type checking and computation, Coq is a challenging target for automated learning techniques. We present novel generation models for learning and suggesting lemma names for Coq projects. Our models, based on multi-input neural networks, are the first to leverage syntactic and semantic information from Coq's lexer (tokens in lemma statements), parser (syntax trees), and kernel (elaborated terms) for naming; the key insight is that learning from elaborated terms can substantially boost model performance. We implemented our models in a toolchain, dubbed Roosterize, and applied it on a large corpus of code derived from the Mathematical Components family of projects, known for its stringent coding conventions. Our results show that Roosterize substantially outperforms baselines for suggesting lemma names, highlighting the importance of using multi-input models and elaborated terms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题