论文标题

旨在揭开源代码嵌入的尺寸

Towards Demystifying Dimensions of Source Code Embeddings

论文作者

Rabin, Md Rafiqul Islam, Mukherjee, Arjun, Gnawali, Omprakash, Alipour, Mohammad Amin

论文摘要

源代码表示是应用机器学习技术进行处理和分析程序的关键。代表源代码的一种流行方法是神经源代码嵌入,它代表通过培训大量程序的深神经网络计算出的高维矢量程序的程序。尽管成功,但对这些向量的内容及其特征鲜为人知。在本文中,我们介绍了我们的初步结果,以更好地了解Code2VEC神经源代码嵌入的内容。特别是,在一个小案例研究中,我们使用Code2VEC嵌入来创建二进制SVM分类器,并将其性能与手工制作的功能进行比较。我们的结果表明,手工制作的功能可以非常接近高度的Code2Vec嵌入,并且与手工制作的功能相比,在Code2Vec嵌入式中,信息获取更加均匀。我们还发现,与手工制作的功能相比,Code2Vec嵌入对于删除尺寸低的尺寸更有弹性。我们希望我们的结果对这些代码表示形式的原则分析和评估有助于垫脚石。

Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with high-dimensional vectors computed by training deep neural networks on a large volume of programs. Although successful, there is little known about the contents of these vectors and their characteristics. In this paper, we present our preliminary results towards better understanding the contents of code2vec neural source code embeddings. In particular, in a small case study, we use the code2vec embeddings to create binary SVM classifiers and compare their performance with the handcrafted features. Our results suggest that the handcrafted features can perform very close to the highly-dimensional code2vec embeddings, and the information gains are more evenly distributed in the code2vec embeddings compared to the handcrafted features. We also find that the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features. We hope our results serve a stepping stone toward principled analysis and evaluation of these code representations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源