论文标题
距离相关性确保独立筛选帕金森氏病的加速功能选择声音数据
Distance Correlation Sure Independence Screening for Accelerated Feature Selection in Parkinson's Disease Vocal Data
论文作者
论文摘要
有了丰富的机器学习方法以及在合奏方法中使用它们的诱惑,拥有一种模型的功能选择方法是令人难以置信的诱人。主成分分析是在1901年开发的,此后一直是该角色的强大竞争者,但最终是一种无监督的方法。它不能保证所选择的功能具有良好的预测能力,因为它不知道正在预测什么。为此,Peng等人。开发了2005年的最小冗余最大相关性(MRMR)方法。它不仅在预测变量之间使用了相互信息,而且还包括与响应的相互信息。估计相互信息和熵往往是昂贵且有问题的努力,即使对于数据集的数据集来说,也会导致过度处理时间,而该数据集则在一项受试的折刀情况下约为750 x 750。为了解决这一问题,我们使用了2012年的一种称为距离相关性肯定独立筛选(DC-SIS)的方法,该方法使用Székely等人的距离相关度量。选择对响应最依赖的功能。我们表明,这种方法与帕金森氏病声诊断数据的MRMR选择方法在统计上无法区分的结果速度快90倍。
With the abundance of machine learning methods available and the temptation of using them all in an ensemble method, having a model-agnostic method of feature selection is incredibly alluring. Principal component analysis was developed in 1901 and has been a strong contender in this role since, but in the end is an unsupervised method. It offers no guarantee that the features that are selected have good predictive power because it does not know what is being predicted. To this end, Peng et al. developed the minimum redundancy-maximum relevance (mRMR) method in 2005. It uses the mutual information not only between predictors but also includes the mutual information with the response in its calculation. Estimating mutual information and entropy tend to be expensive and problematic endeavors, which leads to excessive processing times even for dataset that is approximately 750 by 750 in a Leave-One-Subject-Out jackknife situation. To remedy this, we use a method from 2012 called Distance Correlation Sure Independence Screening (DC-SIS) which uses the distance correlation measure of Székely et al. to select features that have the greatest dependence with the response. We show that this method produces statistically indistinguishable results to the mRMR selection method on Parkinson's Disease vocal diagnosis data 90 times faster.