论文标题
严格的机器学习分析管道用于生物医学二元分类:在胰腺癌嵌套病例对照研究中的应用,对偏见评估的影响
A Rigorous Machine Learning Analysis Pipeline for Biomedical Binary Classification: Application in Pancreatic Cancer Nested Case-control Studies with Implications for Bias Assessments
论文作者
论文摘要
机器学习(ML)提供了用于检测和建模关联的强大方法的集合,通常应用于具有大量功能和/或复杂关联的数据。当前,有许多工具可以促进实施自定义ML分析(例如Scikit-Learn)。自动化的ML软件包的兴趣也在增加,这可以使非专家更容易应用ML并有可能提高模型性能。 ML渗透到大多数生物医学研究的子场,具有不同程度的严格和纠正用法。 ML提供的巨大机会经常被组装综合分析管道以及滥用ML的挑战所抵消。在这项工作中,我们布置并组装了一条完整的,严格的ML分析管道,该管道集中在二进制分类(即情况/控制预测)上,并将此管道应用于模拟和现实世界数据。在高水平上,此“自动化”但可自定义的管道包括a)探索性分析,b)数据清洁和转换,c)特征选择,d)使用9种既定的ML算法的模型培训,每个算法进行了超参数优化,以及e)彻底评估,包括适当的指标,统计分析,统计分析和新型可视化。该管道组织了ML管道组件的许多微妙复杂性,以说明最佳实践,以避免偏见并确保可重复性。此外,该管道是第一个将已建立的ML算法与“ EXSTRACS”进行比较的管道,这是一种基于规则的ML算法,具有独特的能力,可以解释地建模异质的关联模式。虽然设计为广泛适用,但我们将此管道应用于对胰腺癌的已建立和新确定的危险因素的流行病学研究,以评估ML算法如何处理不同的偏见来源。
Machine learning (ML) offers a collection of powerful approaches for detecting and modeling associations, often applied to data having a large number of features and/or complex associations. Currently, there are many tools to facilitate implementing custom ML analyses (e.g. scikit-learn). Interest is also increasing in automated ML packages, which can make it easier for non-experts to apply ML and have the potential to improve model performance. ML permeates most subfields of biomedical research with varying levels of rigor and correct usage. Tremendous opportunities offered by ML are frequently offset by the challenge of assembling comprehensive analysis pipelines, and the ease of ML misuse. In this work we have laid out and assembled a complete, rigorous ML analysis pipeline focused on binary classification (i.e. case/control prediction), and applied this pipeline to both simulated and real world data. At a high level, this 'automated' but customizable pipeline includes a) exploratory analysis, b) data cleaning and transformation, c) feature selection, d) model training with 9 established ML algorithms, each with hyperparameter optimization, and e) thorough evaluation, including appropriate metrics, statistical analyses, and novel visualizations. This pipeline organizes the many subtle complexities of ML pipeline assembly to illustrate best practices to avoid bias and ensure reproducibility. Additionally, this pipeline is the first to compare established ML algorithms to 'ExSTraCS', a rule-based ML algorithm with the unique capability of interpretably modeling heterogeneous patterns of association. While designed to be widely applicable we apply this pipeline to an epidemiological investigation of established and newly identified risk factors for pancreatic cancer to evaluate how different sources of bias might be handled by ML algorithms.