论文标题
CodeMatcher:基于重要查询单词的顺序语义搜索代码
CodeMatcher: Searching Code Based on Sequential Semantics of Important Query Words
论文作者
论文摘要
为了加速软件开发,开发人员经常从大规模代码库(例如GitHub)中搜索和重复使用现有代码段。多年来,研究人员提出了许多用于代码搜索的基于信息检索的模型,但它们无法连接查询和代码之间的语义差距。基于深度学习的早期成功模型DEEPCS通过学习对代码方法对与相应的自然语言描述之间的关系解决了这个问题。 DEEDC的两个主要优势是了解无关/嘈杂的关键字并捕获查询和代码中的单词之间的顺序关系。在本文中,我们提出了一个基于IR的模型CodeMatcher,该模型codeMatcher继承了DEEDC的优势,而它可以利用基于IR的模型中的索引技术来大大加速搜索响应时间。 CodeMatcher first collects metadata for query words to identify irrelevant/noisy ones, then iteratively performs fuzzy search with important query words on the codebase that is indexed by the Elasticsearch tool, and finally reranks a set of returned candidate code according to how the tokens in the candidate code snippet sequentially matched the important words in a query.我们验证了其在〜41k存储库的大规模代码库上的有效性。实验结果表明,CodeMatcher的MRR为0.60,表现优于DEEPC,代码方式和UNIF分别为82%,62%和46%。我们提出的模型比DEEDC的速度快1.2k倍。此外,在MRR方面,CodeMatcher的表现分别优于GitHub和Google搜索分别为46%和33%。我们还观察到:融合基于IR的基于IR和基于DL的模型的优势是有希望的;提高方法命名的质量有助于代码搜索,因为方法名称在连接查询和代码中起重要作用。
To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval based models for code search, but they fail to connect the semantic gap between query and code. An early successful deep learning based model DeepCS solved this issue by learning the relationship between pairs of code methods and corresponding natural language descriptions. Two major advantages of DeepCS are the capability of understanding irrelevant/noisy keywords and capturing sequential relationships between words in query and code. In this paper, we proposed an IR-based model CodeMatcher that inherits the advantages of DeepCS, while it can leverage the indexing technique in the IR-based model to accelerate the search response time substantially. CodeMatcher first collects metadata for query words to identify irrelevant/noisy ones, then iteratively performs fuzzy search with important query words on the codebase that is indexed by the Elasticsearch tool, and finally reranks a set of returned candidate code according to how the tokens in the candidate code snippet sequentially matched the important words in a query. We verified its effectiveness on a large-scale codebase with ~41k repositories. Experimental results showed that CodeMatcher achieves an MRR of 0.60, outperforming DeepCS, CodeHow, and UNIF by 82%, 62%, and 46% respectively. Our proposed model is over 1.2k times faster than DeepCS. Moreover, CodeMatcher outperforms GitHub and Google search by 46% and 33% respectively in terms of MRR. We also observed that: fusing the advantages of IR-based and DL-based models is promising; improving the quality of method naming helps code search, since method name plays an important role in connecting query and code.