论文标题
增强编码:一种通过编码培训标签的新型不平衡分类方法
Enhancement Encoding: A Novel Imbalanced Classification Approach via Encoding the Training Labels
论文作者
论文摘要
类不平衡也称为长尾分布,是基于机器学习的分类任务中的一个常见问题。如果发生这种情况,少数数据将被大多数人所淹没,这给数据科学带来了巨大的挑战。为了解决班级不平衡问题,研究人员提出了许多方法:有些人使数据集保持平衡(SMOTE),其他一些人则完善了损失功能(局部损失),甚至有人注意到标签的价值会影响类失去平衡的学习(Yang和Xu。Yang and Xu。重新考虑了Class-bbority bborical nebal in neaurips in Neaurips in Neaurips in neaurips in neaurips in neaurips in neaurips in neaurips in neaurips in ne ne n of。如今,编码标签的最普遍的技术是单次编码,因为它在一般情况下表现出色。但是,对于数据不平衡的数据,这不是一个不错的选择,因为分类器将同样对待多数和少数族裔样本。在本文中,我们创新提出了增强编码技术,该技术专为分类而设计。编码的增强功能结合了重新加权和成本敏感性,这可以反映硬性和容易(或少数族裔和多数)类之间的差异。为了减少验证样本的数量和计算成本,我们还用一种新型的软融合矩阵替换了混淆矩阵,该矩阵在较小的验证集中效果更好。在实验中,我们评估了使用三种不同类型的损失的增强编码。结果表明,增强编码非常有效地提高了使用不平衡数据训练的网络的性能。特别是,在少数民族班上的表现要好得多。
Class imbalance, which is also called long-tailed distribution, is a common problem in classification tasks based on machine learning. If it happens, the minority data will be overwhelmed by the majority, which presents quite a challenge for data science. To address the class imbalance problem, researchers have proposed lots of methods: some people make the data set balanced (SMOTE), some others refine the loss function (Focal Loss), and even someone has noticed the value of labels influences class-imbalanced learning (Yang and Xu. Rethinking the value of labels for improving class-imbalanced learning. In NeurIPS 2020), but no one changes the way to encode the labels of data yet. Nowadays, the most prevailing technique to encode labels is the one-hot encoding due to its nice performance in the general situation. However, it is not a good choice for imbalanced data, because the classifier will treat majority and minority samples equally. In this paper, we innovatively propose the enhancement encoding technique, which is specially designed for the imbalanced classification. The enhancement encoding combines re-weighting and cost-sensitiveness, which can reflect the difference between hard and easy (or minority and majority) classes. To reduce the number of validation samples and the computation cost, we also replace the confusion matrix with a novel soft-confusion matrix which works better with a small validation set. In the experiments, we evaluate the enhancement encoding with three different types of loss. And the results show that enhancement encoding is very effective to improve the performance of the network trained with imbalanced data. Particularly, the performance on minority classes is much better.