从样本量非常低的高维数据中选择特征：一个警示性故事

论文标题

从样本量非常低的高维数据中选择特征：一个警示性故事

Feature Selection from High-Dimensional Data with Very Low Sample Size: A Cautionary Tale

论文作者

Kuncheva, Ludmila I., Matthews, Clare E., Arnaiz-González, Álvar, Rodríguez, Juan J.

论文摘要

在分类问题中，特征选择的目的是确定原始特征集的小型，高度歧视的子集。在许多应用程序中，数据集可能具有数千个功能，并且只有几十个样本（有时称为“宽”）。这项研究是一个警示性的故事，证明了为什么在这种情况下特征选择可能导致不良结果。为了突出显示样本量问题，我们得出所需的样本量，以声明两个不同的功能。使用一个示例，我们说明了特征集和分类器之间的繁重依赖性，这对分类器 - 无义者特征选择方法提出了一个问题。但是，正如另一个示例所示，良好的选择器分类器对的选择受到估计和真实错误率之间的低相关性的阻碍。在先前提出类似问题的研究中，大多数综合数据验证了他们的信息，但在这里，我们进行了一个实验，该实验使用了20个真实的数据集。我们创建了一个夸张的方案，从而将数据的一小部分（每类10个实例）切成功能选择，并将其余数据用于测试。结果加强了谨慎，并建议最好不要从非常宽的数据集中选择功能选择，而不是将误导性输出返回给用户。

In classification problems, the purpose of feature selection is to identify a small, highly discriminative subset of the original feature set. In many applications, the dataset may have thousands of features and only a few dozens of samples (sometimes termed `wide'). This study is a cautionary tale demonstrating why feature selection in such cases may lead to undesirable results. In view to highlight the sample size issue, we derive the required sample size for declaring two features different. Using an example, we illustrate the heavy dependency between feature set and classifier, which poses a question to classifier-agnostic feature selection methods. However, the choice of a good selector-classifier pair is hampered by the low correlation between estimated and true error rate, as illustrated by another example. While previous studies raising similar issues validate their message with mostly synthetic data, here we carried out an experiment with 20 real datasets. We created an exaggerated scenario whereby we cut a very small portion of the data (10 instances per class) for feature selection and used the rest of the data for testing. The results reinforce the caution and suggest that it may be better to refrain from feature selection from very wide datasets rather than return misleading output to the user.

下载PDF全文

下载文献需遵守相关版权规定

论文标题