一种评估机器学习中数据分裂质量的诊断方法

论文标题

一种评估机器学习中数据分裂质量的诊断方法

A Diagnostic Approach to Assess the Quality of Data Splitting in Machine Learning

论文作者

Jain, Eklavya, Neeraja, J., Banerjee, Buddhananda, Ghosh, Palash

论文摘要

在机器学习中，常规实践是将数据分为培训和测试数据集。提出的模型是根据培训数据构建的，然后使用测试数据评估模型的性能。通常，将数据随机分为训练和测试集。这种方法在随机分裂方面枢纽，运行良好，但通常情况下，它无法衡量模型在训练和测试数据输入中的扰动方面的推广能力。在实验上，当执行从模型构建到训练和测试的固定管道的新迭代并报告了过度乐观的性能估算时，就实现了输入数据中随机性的敏感方面。由于模型性能中的一致性主要取决于数据分裂，因此在这种情况下，关于模型鲁棒性的任何结论都是不可靠的。我们提出了一种诊断方法，以定量评估给定分裂的真实随机性的质量，并为推断对输入数据的模型不敏感提供了基础。我们将模型的鲁棒性与基于自定义的数据驱动距离度量的随机分裂相关联，基于火车集及其相应测试集之间的Mahalanobis平方距离。使用蒙特卡洛模拟模拟距离度量的概率分布，并根据单方面的假设检验计算阈值。我们使用各种真实的数据集激励和展示提出方法的性能。我们还使用建议的方法比较了现有数据分裂方法的性能。

In machine learning, a routine practice is to split the data into a training and a test data set. A proposed model is built based on the training data, and then the performance of the model is assessed using test data. Usually, the data is split randomly into a training and a test set on an ad hoc basis. This approach, pivoted on random splitting, works well but more often than not, it fails to gauge the generalizing capability of the model with respect to perturbations in the input of training and test data. Experimentally, this sensitive aspect of randomness in the input data is realized when a new iteration of a fixed pipeline, from model building to training and testing, is executed, and an overly optimistic performance estimate is reported. Since the consistency in a model's performance predominantly depends on the data splitting, any conclusions on the robustness of the model are unreliable in such a scenario. We propose a diagnostic approach to quantitatively assess the quality of a given split in terms of its true randomness, and provide a basis for inferring model insensitivity towards the input data. We associate model robustness with random splitting using a self-defined data-driven distance metric based on the Mahalanobis squared distance between a train set and its corresponding test set. The probability distribution of the distance metric is simulated using Monte Carlo simulations, and a threshold is calculated from one-sided hypothesis testing. We motivate and showcase the performance of the proposed approach using various real data sets. We also compare the performance of the existing data splitting methods using the proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题