关于数据集大小和类不平衡在评估基于机器学习的Windows恶意软件检测技术中的影响

论文标题

关于数据集大小和类不平衡在评估基于机器学习的Windows恶意软件检测技术中的影响

On the impact of dataset size and class imbalance in evaluating machine-learning-based windows malware detection techniques

论文作者

Illes, David

论文摘要

该项目的目的是收集和分析有关关注Microsoft Windows恶意软件的已发表结果的可比性和现实生活的数据，更具体地说是数据集大小和测试数据集不平衡对测量检测器性能的影响。一些研究人员使用较小的数据集，如果数据集大小对性能产生重大影响，这使得已发布的结果很难比较。研究人员还倾向于使用平衡的数据集和准确性作为测试的指标。前者并不是现实的真实代表，在这种情况下，良性样本明显超过了恶意软件，而后者的方法对于不平衡问题而言是有问题的。该项目确定了两个关键目标，以了解数据集大小是否与测量的检测器性能相关，以防止有意义的已发表结果比较，并了解是否可以在现实世界的部署方案中表现良好。该研究的结果表明，数据集的大小确实与测量的检测器性能相关，以阻止有意义地比较已发表的结果，并且不了解训练集大小准确性曲线的性质，即在方法“更好”的方法之间得出的结果结论不应完全基于准确的得分做出“更好”的方法。结果还表明，高精度得分不一定会转化为高现实世界的性能。

The purpose of this project was to collect and analyse data about the comparability and real-life applicability of published results focusing on Microsoft Windows malware, more specifically the impact of dataset size and testing dataset imbalance on measured detector performance. Some researchers use smaller datasets, and if dataset size has a significant impact on performance, that makes comparison of the published results difficult. Researchers also tend to use balanced datasets and accuracy as a metric for testing. The former is not a true representation of reality, where benign samples significantly outnumber malware, and the latter is approach is known to be problematic for imbalanced problems. The project identified two key objectives, to understand if dataset size correlates to measured detector performance to an extent that prevents meaningful comparison of published results, and to understand if good performance reported in published research can be expected to perform well in a real-world deployment scenario. The research's results suggested that dataset size does correlate with measured detector performance to an extent that prevents meaningful comparison of published results, and without understanding the nature of the training set size-accuracy curve for published results conclusions between approaches on which approach is "better" shouldn't be made solely based on accuracy scores. Results also suggested that high accuracy scores don't necessarily translate to high real-world performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题