论文标题

在标签和测量噪声的存在下实时判别分析

Real-time discriminant analysis in the presence of label and measurement noise

论文作者

Vranckx, Iwein, Raymaekers, Jakob, De Ketelaere, Bart, Rousseeuw, Peter J., Hubert, Mia

论文摘要

二次判别分析(QDA)是一种广泛使用的分类技术。基于培训数据集,数据中的每个类的特征是其中心和形状的估计值,然后可以将其用于将看不见的观察结果分配给其中一个类。传统的QDA规则依赖于经验平均值和协方差矩阵。不幸的是,这些估计器对标签和测量噪声敏感,这通常会损害模型的预测能力。稳健的位置和散射估计器对这种污染具有抗性。但是,它们对大型工业实验的计算成本过高。我们基于最近的实时鲁棒算法提出了一种新颖的QDA方法。我们还集成了一个异常检测步骤,将最非典型的观测值分类为单独的异常值类别。最后,我们介绍了标签偏置图,这是一种图形显示,以识别训练数据中的标签和测量噪声。在具有庞大的数据集以及有关糖尿病和水果的真实数据集中,在一项模拟研究中说明了所提出方法的性能。

Quadratic discriminant analysis (QDA) is a widely used classification technique. Based on a training dataset, each class in the data is characterized by an estimate of its center and shape, which can then be used to assign unseen observations to one of the classes. The traditional QDA rule relies on the empirical mean and covariance matrix. Unfortunately, these estimators are sensitive to label and measurement noise which often impairs the model's predictive ability. Robust estimators of location and scatter are resistant to this type of contamination. However, they have a prohibitive computational cost for large scale industrial experiments. We present a novel QDA method based on a recent real-time robust algorithm. We additionally integrate an anomaly detection step to classify the most atypical observations into a separate class of outliers. Finally, we introduce the label bias plot, a graphical display to identify label and measurement noise in the training data. The performance of the proposed approach is illustrated in a simulation study with huge datasets, and on real datasets about diabetes and fruit.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源