论文标题
TraceSim:一种计算堆栈跟踪相似性的方法
TraceSim: A Method for Calculating Stack Trace Similarity
论文作者
论文摘要
许多当代软件产品具有用于自动崩溃报告的子系统。但是,众所周知,相同的错误可能会产生略有不同的报告。为了解决此问题,通常将报告通常由开发人员手动分组。但是,对于拥有大用户群的产品,手动分盘是不可行的,这是许多不同方法自动执行此任务的原因。此外,由于需要正确处理的大量报告,提高三角轴的质量很重要。因此,即使是相对较小的改进也可能在报告存储桶的总体准确性中起重要作用。大多数现有研究都使用某种堆栈跟踪相似性度量,无论是基于信息检索技术还是字符串匹配方法。但是,应该强调的是,三角轴的质量仍然不足。在本文中,我们描述了TraceSim - 一种解决此问题的新方法,结合了TF-IDF,Levenshtein距离和机器学习以构建相似性度量。我们的指标已在工业级报告中实施。与基线方法相比,对手动标记数据集的评估显示出明显更好的结果。
Many contemporary software products have subsystems for automatic crash reporting. However, it is well-known that the same bug can produce slightly different reports. To manage this problem, reports are usually grouped, often manually by developers. Manual triaging, however, becomes infeasible for products that have large userbases, which is the reason for many different approaches to automating this task. Moreover, it is important to improve quality of triaging due to the big volume of reports that needs to be processed properly. Therefore, even a relatively small improvement could play a significant role in overall accuracy of report bucketing. The majority of existing studies use some kind of a stack trace similarity metric, either based on information retrieval techniques or string matching methods. However, it should be stressed that the quality of triaging is still insufficient. In this paper, we describe TraceSim -- a novel approach to address this problem which combines TF-IDF, Levenshtein distance, and machine learning to construct a similarity metric. Our metric has been implemented inside an industrial-grade report triaging system. The evaluation on a manually labeled dataset shows significantly better results compared to baseline approaches.