论文标题

SOUDS:找到类似软件项目的工具

Sosed: a tool for finding similar software projects

论文作者

Bogomolov, Egor, Golubev, Yaroslav, Lobanov, Artyom, Kovalenko, Vladimir, Bryksin, Timofey

论文摘要

在本文中,我们介绍了Sosed,这是一种发现类似软件项目的工具。我们使用FastText将微作物的嵌入到一个密集的空间中,以使用200种语言的120,000个GitHub存储库。然后,我们聚类嵌入,以识别反映源代码中主题的语义上相似的子折断组的组。我们将900万个GitHub项目的数据集用作参考搜索基础。为了识别类似的项目,我们比较了群集在其子tokens中的分布。该工具接收一个任意项目作为输入,在16种最受欢迎​​的编程语言中提取子折断,计算集群分布,并找到搜索基库中最接近分布的项目。我们标记了具有简短描述的微源群集,以产生可解释的输出。 SODE可在https://github.com/jetbrains-research/sode/上找到。该工具演示可从https://www.youtube.com/watch?v=lylkztcgrt8获得。亚tokens的多语言提取器可在https://github.com/jetbrains-research/buckwheat/上分别提供。

In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subtokens into a dense space for 120,000 GitHub repositories in 200 languages. Then, we cluster embeddings to identify groups of semantically similar sub-tokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their sub-tokens. The tool receives an arbitrary project as input, extracts sub-tokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of sub-tokens is available separately at https://github.com/JetBrains-Research/buckwheat/.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源