折叠和补充“通过学习一般令牌重复来源代码（具有优化的内存）来改善代码完成的语言建模””

论文标题

折叠和补充“通过学习一般令牌重复来源代码（具有优化的内存）来改善代码完成的语言建模””

Corrigendum and Supplement to "Improve Language Modelling for Code Completion through Learning General Token Repetition of Source Code (with Optimized Memory)"

论文作者

Yang, Yixiao

论文摘要

本文之所以写，是因为我收到几封询问电子邮件，说在应用令牌重复学习技巧时很难取得良好的结果。如果REP（由我提出）或指针混合（由Jian Li提出）直接应用于源代码以决定所有令牌重复，则模型性能将急剧下降。当我们使用预购遍历来遍历抽象语法树（AST）生成令牌序列时，学习令牌重复时，忽略了与AST语法相对应的令牌。对于非上级令牌，有很多种类：字符串，炭，数字和标识符。对于每种令牌，我们尝试学习其重复模式，并发现只有标识符具有令牌重复的属性。对于标识符，还有许多类型，例如变量，软件包名称，方法名称，简单类型，合格类型或合格名称。实际上，不太可能重复某些标识符，例如软件包名称，方法名称，合格的名称或合格类型。因此，我们忽略了学习代币重复时不太可能重复的这些标识符。此步骤至关重要，本文中没有明确提出这个重要的实施技巧，因为我们认为这很微不足道，细节太多可能会困扰读者。我们在会议论文中提供了模型的GitHub地址，读者可以检查该存储库中的描述和实现。因此，在本文中，我们为已经发表的论文补充了重要的实施优化细节。

This paper is written because I receive several inquiry emails saying it is hard to achieve good results when applying token repetition learning techniques. If REP (proposed by me) or Pointer-Mixture (proposed by Jian Li) is directly applied to source code to decide all token repetitions, the model performance will decrease sharply. As we use pre-order traversal to traverse the Abstract Syntax Tree (AST) to generate token sequence, tokens corresponding to AST grammar are ignored when learning token repetition. For non-grammar tokens, there are many kinds: strings, chars, numbers and identifiers. For each kind of tokens, we try to learn its repetition pattern and find that only identifiers have the property of token repetition. For identifiers, there are also many kinds such as variables, package names, method names, simple types, qualified types or qualified names. Actually, some kinds of identifiers such as package names, method names, qualified names or qualified types are unlikely to be repeated. Thus, we ignore these kinds of identifiers that are unlikely to be repeated when learning token repetition. This step is crucial and this important implementation trick is not clearly presented in the paper because we think it is trivial and too many details may bother readers. We offer the GitHub address of our model in our conference paper and readers can check the description and implementation in that repository. Thus, in this paper, we supplement the important implementation optimization details for the already published papers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题