bugl-一个用于错误本地化的跨语言数据集

论文标题

bugl-一个用于错误本地化的跨语言数据集

BuGL -- A Cross-Language Dataset for Bug Localization

论文作者

Muvva, Sandeep, Rao, A Eashaan, Chimalakonda, Sridhar

论文摘要

错误本地化是从给定的错误报告和源代码中找到潜在的容易出错文件或方法的过程。文献中对错误本地化的广泛研究，重点是应用信息检索技术或机器学习/深度学习方法或两者兼而有之，以检测错误的位置。所有方法的常见前提是一个良好的数据集的可用性，在这种情况下，该数据集是由6个Java项目组成的标准基准数据集，在某些情况下，有超过6个Java项目。尽管需要调查特定和跨项目错误本地化，但现有数据集并不包括其他编程语言的项目。据我们所知，我们不知道有任何解决此问题的数据集。在本文中，我们提出了Bugl，这是一个大规模的跨语言数据集。 BUGL构成了10,000多个错误报告，这些报告是从用四种编程语言（即C，C ++，Java和Python）编写的开源项目中绘制的。数据集包含包括错误报告和拉普雷斯的信息。 Bugl的目的是在错误本地化领域展开新的研究机会。

Bug Localization is the process of locating potential error-prone files or methods from a given bug report and source code. There is extensive research on bug localization in the literature that focuses on applying information retrieval techniques or machine learning/deep learning approaches or both, to detect location of bugs. The common premise for all approaches is the availability of a good dataset, which in this case, is the standard benchmark dataset that comprises of 6 Java projects and in some cases, more than 6 Java projects. The existing dataset do not comprise projects of other programming languages, despite of the need to investigate specific and cross project bug localization. To the best of our knowledge, we are not aware of any dataset that addresses this concern. In this paper, we present BuGL, a large-scale cross-language dataset. BuGL constitutes of more than 10,000 bug reports drawn from open-source projects written in four programming languages, namely C, C++, Java, and Python. The dataset consists of information which includes Bug Reports and Pull-Requests. BuGL aims to unfold new research opportunities in the area of bug localization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题