论文标题

Gitranking:使用主动采样的软件分类的GitHub主题排名

GitRanking: A Ranking of GitHub Topics for Software Classification using Active Sampling

论文作者

Sas, Cezar, Capiluppi, Andrea, Di Sipio, Claudio, Di Rocco, Juri, Di Ruscio, Davide

论文摘要

Github是世界上最大的源代码主机,拥有超过1.50亿个存储库。但是,这些存储库中的大多数都没有标记或不足,因此用户很难找到相关的项目。在过去几年中,已经提出了有关软件应用程序域分类的各种建议。但是,这些方法缺乏定义明确的分类学,该分类学是层次的,基于知识基础,并且没有无关的术语。这项工作提出了Gitranking,这是根据其含义的一般或特定的含义,用于创建分类为离散级别的框架。我们从GitHub收集了121k主题,并考虑了排名最常见的$ 60 \%$。 Gitranking 1)使用主动采样来确保最少的必需注释; 2)将每个主题与Wikidata联系起来,降低歧义并改善分类法的可重复性。我们的结果表明,开发人员在注释项目时,请避免使用具有高度特异性的术语。这使得对其项目的发现和发现对其他用户更具挑战性。此外,我们表明,吉特拉克(Gitranking)可以根据其一般或特定含义有效地对术语进行排名。该排名将是开发人员要建立的重要资产,使他们能够以更精确的主题来补充他们的注释。最后,我们证明Gitranking是一种动态扩展的方法:目前可以接受以最少的注释数量($ \ sim $ 15)对进一步的术语进行排名。本文是建立软件域分类法的首次集体尝试。

GitHub is the world's largest host of source code, with more than 150M repositories. However, most of these repositories are not labeled or inadequately so, making it harder for users to find relevant projects. There have been various proposals for software application domain classification over the past years. However, these approaches lack a well-defined taxonomy that is hierarchical, grounded in a knowledge base, and free of irrelevant terms. This work proposes GitRanking, a framework for creating a classification ranked into discrete levels based on how general or specific their meaning is. We collected 121K topics from GitHub and considered $60\%$ of the most frequent ones for the ranking. GitRanking 1) uses active sampling to ensure a minimal number of required annotations; and 2) links each topic to Wikidata, reducing ambiguities and improving the reusability of the taxonomy. Our results show that developers, when annotating their projects, avoid using terms with a high degree of specificity. This makes the finding and discovery of their projects more challenging for other users. Furthermore, we show that GitRanking can effectively rank terms according to their general or specific meaning. This ranking would be an essential asset for developers to build upon, allowing them to complement their annotations with more precise topics. Finally, we show that GitRanking is a dynamically extensible method: it can currently accept further terms to be ranked with a minimum number of annotations ($\sim$ 15). This paper is the first collective attempt to build a ground-up taxonomy of software domains.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源