避免在选择标签的情况下避免有偏见的临床机器学习模型绩效估计

论文标题

避免在选择标签的情况下避免有偏见的临床机器学习模型绩效估计

Avoiding Biased Clinical Machine Learning Model Performance Estimates in the Presence of Label Selection

论文作者

Corbin, Conor K., Baiocchi, Michael, Chen, Jonathan H.

论文摘要

在评估临床机器学习模型的性能时，必须考虑部署人群。当观察到的标签患者的人群只是部署人群的一部分（选择标签）时，对观察到的人群的标准模型绩效估计可能会误导。在这项研究中，我们描述了三类的标签选择，并模拟了五种有因果关系的情况，以评估特定选择机制如何偏向一套常见的二进制机器学习模型性能指标。模拟表明，当选择受观察到的特征影响时，模型歧视的天真估计可能会产生误导。当选择受标签影响时，校准的天真估计无法反映现实。我们从因果推理文献中借用传统的加权估计器，发现当正确指定选择概率时，它们会恢复全部人口估计。然后，我们解决了监视部署的机器学习模型的性能的现实任务，该模型的相互作用与临床医生相互作用并影响标签的选择机制。我们训练三个机器学习模型来标记低收益实验室的诊断，并模拟其降低浪费实验室利用的预期结果。我们发现，对观察到的人群的幼稚估计值降低了20％。这样的差异可能足够大，可以导致成功终止成功的临床决策支持工具。我们提出了一个更改的部署程序，该程序将注入随机的随机化与传统加权估计结合在一起，并发现其恢复了真正的模型性能。

When evaluating the performance of clinical machine learning models, one must consider the deployment population. When the population of patients with observed labels is only a subset of the deployment population (label selection), standard model performance estimates on the observed population may be misleading. In this study we describe three classes of label selection and simulate five causally distinct scenarios to assess how particular selection mechanisms bias a suite of commonly reported binary machine learning model performance metrics. Simulations reveal that when selection is affected by observed features, naive estimates of model discrimination may be misleading. When selection is affected by labels, naive estimates of calibration fail to reflect reality. We borrow traditional weighting estimators from causal inference literature and find that when selection probabilities are properly specified, they recover full population estimates. We then tackle the real-world task of monitoring the performance of deployed machine learning models whose interactions with clinicians feed-back and affect the selection mechanism of the labels. We train three machine learning models to flag low-yield laboratory diagnostics, and simulate their intended consequence of reducing wasteful laboratory utilization. We find that naive estimates of AUROC on the observed population undershoot actual performance by up to 20%. Such a disparity could be large enough to lead to the wrongful termination of a successful clinical decision support tool. We propose an altered deployment procedure, one that combines injected randomization with traditional weighted estimates, and find it recovers true model performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题