论文标题

跑步,森林,跑步?关于预测软件工程的随机化和可重复性

Run, Forest, Run? On Randomization and Reproducibility in Predictive Software Engineering

论文作者

Liem, Cynthia C. S., Panichella, Annibale

论文摘要

机器学习(ML)已在文献中广泛用于自动化软件工程任务。但是,在数据采样机制和学习过程中,ML结果可能对随机化敏感。为了了解SE中的研究人员是否以及如何应对这些威胁,我们调查了45篇与三个预测任务相关的论文:缺陷预测(DP),预测突变测试(PMT)和代码气味检测(CSD)。我们发现,不到50%的被调查论文解决了与随机数据采样有关的威胁(通过多次重复);只有8%的论文解决了ML的随机性;和参数值很少报道(只有18%的论文)。为了评估这些威胁的严重程度,我们使用了八个常见的监督ML分类器,使用了对三个感兴趣的预测任务考虑的26个现实世界数据集进行了一项实证研究。我们表明,10倍交叉验证的不同数据重采样导致观察到的性能结果极大地变化。此外,对于随机种子的不同选择,随机ML方法还显示出不可忽略的变异性。更令人担忧的是,性能和可变性对于不同库中概念上相同的ML方法的不同实现不一致,这也通过多数据集对比较显示。为了应对这些关键威胁,我们提供了有关如何验证,评估和报告预测方法结果的实用准则。

Machine learning (ML) has been widely used in the literature to automate software engineering tasks. However, ML outcomes may be sensitive to randomization in data sampling mechanisms and learning procedures. To understand whether and how researchers in SE address these threats, we surveyed 45 recent papers related to three predictive tasks: defect prediction (DP), predictive mutation testing (PMT), and code smell detection (CSD). We found that less than 50% of the surveyed papers address the threats related to randomized data sampling (via multiple repetitions); only 8% of the papers address the random nature of ML; and parameter values are rarely reported (only 18% of the papers). To assess the severity of these threats, we conducted an empirical study using 26 real-world datasets commonly considered for the three predictive tasks of interest, considering eight common supervised ML classifiers. We show that different data resamplings for 10-fold cross-validation lead to extreme variability in observed performance results. Furthermore, randomized ML methods also show non-negligible variability for different choices of random seeds. More worryingly, performance and variability are inconsistent for different implementations of the conceptually same ML method in different libraries, as also shown through multi-dataset pairwise comparison. To cope with these critical threats, we provide practical guidelines on how to validate, assess, and report the results of predictive methods.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源