论文标题
与基于图的暹罗网络中的源代码中的功能相似性建模
Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks
论文作者
论文摘要
代码克隆是重复的代码片段,它们共享(几乎)类似的语法或语义。代码克隆检测在软件维护,代码重构和重用中起重要作用。过去已经进行了大量研究以检测克隆。这些方法中的大多数都使用词汇和句法信息来检测克隆。但是,只有少数针对语义克隆。最近,由于深度学习模型在其他领域的成功,包括自然语言处理和计算机视觉,研究人员试图采用深度学习技术来检测代码克隆。这些方法使用词汇信息(令牌)和(或)句法结构(例如抽象语法树(ASTS))来检测代码克隆。但是,他们没有充分利用可用的结构和语义信息,因此限制了它们的能力。 本文使用程序依赖图和几何神经网络解决语义代码克隆检测的问题,利用结构化的句法和语义信息。我们已经根据我们的新方法开发了Holmes的原型工具,并根据流行的代码克隆基准对其进行了经验评估。我们的结果表明,福尔摩斯的性能要比其他最先进的工具TBCCD好得多。我们还在看不见的项目上评估了福尔摩斯,并进行了跨数据集实验,以评估福尔摩斯的普遍性。我们的结果肯定,福尔摩斯的表现要优于TBCCD,因为福尔摩斯检测到的大多数对均未被TBCCD进行了未检测或次优报告。
Code clones are duplicate code fragments that share (nearly) similar syntax or semantics. Code clone detection plays an important role in software maintenance, code refactoring, and reuse. A substantial amount of research has been conducted in the past to detect clones. A majority of these approaches use lexical and syntactic information to detect clones. However, only a few of them target semantic clones. Recently, motivated by the success of deep learning models in other fields, including natural language processing and computer vision, researchers have attempted to adopt deep learning techniques to detect code clones. These approaches use lexical information (tokens) and(or) syntactic structures like abstract syntax trees (ASTs) to detect code clones. However, they do not make sufficient use of the available structural and semantic information hence, limiting their capabilities. This paper addresses the problem of semantic code clone detection using program dependency graphs and geometric neural networks, leveraging the structured syntactic and semantic information. We have developed a prototype tool HOLMES, based on our novel approach, and empirically evaluated it on popular code clone benchmarks. Our results show that HOLMES performs considerably better than the other state-of-the-art tool, TBCCD. We also evaluated HOLMES on unseen projects and performed cross dataset experiments to assess the generalizability of HOLMES. Our results affirm that HOLMES outperforms TBCCD since most of the pairs that HOLMES detected were either undetected or suboptimally reported by TBCCD.