论文标题
在多站点神经成像数据集中检测并纠正偏差
Detect and Correct Bias in Multi-Site Neuroimaging Datasets
论文作者
论文摘要
培训复杂的机器学习算法并提高关联研究中的统计能力的愿望驱动神经影像学研究以使用越来越多的数据集。增加样本量的最明显方法是从独立研究中汇总扫描。但是,简单的合并通常被认为是选择,测量和混淆的偏见可能会蔓延并产生虚假的相关性。在这项工作中,我们结合了17个研究中大脑的35,320个磁共振图像,以检查神经影像学的偏见。在第一个实验中,命名数据集,我们通过证明可以将扫描正确分配给其各自的数据集,以71.5%的准确性来提供偏见的经验证据。有了这样的证据,我们仔细研究了混杂的偏见,这通常被视为观察性研究中的主要缺点。实际上,我们既不知道所有潜在的混杂因素,也不知道他们的数据。因此,我们将混杂因素建模为未知的潜在变量。然后,使用Kolmogorov复杂性来确定混杂或因果模型是图形模型的最简单分解。最后,我们提出了数据集协调的方法,并研究了它们消除成像特征中偏见的能力。特别是,我们提出了最近引入的战斗算法的扩展,以控制图像特征跨图像特征的全球变化,灵感来自于调整遗传学中的人群分层。我们的结果表明,协调可以减少图像特征中的数据集特定信息。此外,混淆偏见可以减少,甚至变成因果关系。但是,和声也需要谨慎,因为它可以轻松删除相关的特定主题信息。代码可从https://github.com/ai-med/dataset-bias获得。
The desire to train complex machine learning algorithms and to increase the statistical power in association studies drives neuroimaging research to use ever-larger datasets. The most obvious way to increase sample size is by pooling scans from independent studies. However, simple pooling is often ill-advised as selection, measurement, and confounding biases may creep in and yield spurious correlations. In this work, we combine 35,320 magnetic resonance images of the brain from 17 studies to examine bias in neuroimaging. In the first experiment, Name That Dataset, we provide empirical evidence for the presence of bias by showing that scans can be correctly assigned to their respective dataset with 71.5% accuracy. Given such evidence, we take a closer look at confounding bias, which is often viewed as the main shortcoming in observational studies. In practice, we neither know all potential confounders nor do we have data on them. Hence, we model confounders as unknown, latent variables. Kolmogorov complexity is then used to decide whether the confounded or the causal model provides the simplest factorization of the graphical model. Finally, we present methods for dataset harmonization and study their ability to remove bias in imaging features. In particular, we propose an extension of the recently introduced ComBat algorithm to control for global variation across image features, inspired by adjusting for population stratification in genetics. Our results demonstrate that harmonization can reduce dataset-specific information in image features. Further, confounding bias can be reduced and even turned into a causal relationship. However, harmonziation also requires caution as it can easily remove relevant subject-specific information. Code is available at https://github.com/ai-med/Dataset-Bias.