免费债务：使用半监督学习的自我吸附技术债务识别的标签成本最小化

论文标题

免费债务：使用半监督学习的自我吸附技术债务识别的标签成本最小化

DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt Identification using Semi-Supervised Learning

论文作者

Tu, Huy, Menzies, Tim

论文摘要

跟踪和管理自我吸附的技术债务（SATD）对于维护健康软件项目很重要。当前的主动学习SATD识别工具涉及对24％的测试评论进行手动检查，以达到召回的90％。在所有测试评论中，大约5％是SATD。然后，人类专家必须阅读几乎是SATD评论的五分之一，这表明该工具的效率低下。另外，人类专家仍然容易出错：以前工作的假阳性标签中有95％实际上是真正的阳性。为了解决上述问题，我们提出了基于无监督学习的两种模式框架，以识别SATD。在模式1中，当现有的培训数据未标记时，债务欠款以无监督的学习者开始，以自动对培训数据中的编程评论进行伪标记。相比之下，在与相应的培训数据一起使用标签的模式2中，债务范围从预处理人开始，该处理器可以识别来自测试数据集的高度容易发生的SATD。然后，我们的机器学习模型被用来帮助人类专家手动识别其余的SATD。我们在10个软件项目上的实验表明，这两个模型在最先进的自动化和半自动化模型上的有效性具有统计学上的显着提高。具体而言，在模式1（未标记的培训数据）中，债务额可以将标签工作减少99％，在模式2（标记的培训数据）中，标签工作最多可将标签工作减少，同时将当前活跃的学习者的F1相对100％提高。

Keeping track of and managing Self-Admitted Technical Debts (SATDs) is important for maintaining a healthy software project. Current active-learning SATD recognition tool involves manual inspection of 24% of the test comments on average to reach 90% of the recall. Among all the test comments, about 5% are SATDs. The human experts are then required to read almost a quintuple of the SATD comments which indicates the inefficiency of the tool. Plus, human experts are still prone to error: 95% of the false-positive labels from previous work were actually true positives. To solve the above problems, we propose DebtFree, a two-mode framework based on unsupervised learning for identifying SATDs. In mode1, when the existing training data is unlabeled, DebtFree starts with an unsupervised learner to automatically pseudo-label the programming comments in the training data. In contrast, in mode2 where labels are available with the corresponding training data, DebtFree starts with a pre-processor that identifies the highly prone SATDs from the test dataset. Then, our machine learning model is employed to assist human experts in manually identifying the remaining SATDs. Our experiments on 10 software projects show that both models yield a statistically significant improvement in effectiveness over the state-of-the-art automated and semi-automated models. Specifically, DebtFree can reduce the labeling effort by 99% in mode1 (unlabeled training data), and up to 63% in mode2 (labeled training data) while improving the current active learner's F1 relatively to almost 100%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题