论文标题
在存在未标记的数据的情况下预测生存结果
Predicting Survival Outcomes in the Presence of Unlabeled Data
论文作者
论文摘要
随着时间的流逝,许多临床研究需要对患者进行随访。这是具有挑战性的:除了经常观察到的辍学外,通常还有组织和财务挑战,这可能导致数据收集减少,进而可能使随后的分析复杂化。相比之下,通常有大量的基线数据可用于具有相似特征和背景信息的患者,例如,从落在研究时间窗口之外的患者中。在本文中,我们研究了是否可以从包含这种未标记的数据实例中受益,以预测准确的生存时间。换句话说,我们在生存分析的背景下引入了第三级的监督,除了完全观察和审查的实例外,我们还包括未标记的实例。我们提出了三种处理这种新型环境的方法,并在15个现实生活中的临床和基因表达数据集中提供了经验比较。我们的结果表明,所有方法都能够提高预测性能,而不是独立的测试数据。我们还表明,与不使用未使用未标记的数据相比,在半监督包装纸方法中整合由审查数据提供的部分监督通常会提供最佳的结果,通常可以实现很高的改进。
Many clinical studies require the follow-up of patients over time. This is challenging: apart from frequently observed drop-out, there are often also organizational and financial challenges, which can lead to reduced data collection and, in turn, can complicate subsequent analyses. In contrast, there is often plenty of baseline data available of patients with similar characteristics and background information, e.g., from patients that fall outside the study time window. In this article, we investigate whether we can benefit from the inclusion of such unlabeled data instances to predict accurate survival times. In other words, we introduce a third level of supervision in the context of survival analysis, apart from fully observed and censored instances, we also include unlabeled instances. We propose three approaches to deal with this novel setting and provide an empirical comparison over fifteen real-life clinical and gene expression survival datasets. Our results demonstrate that all approaches are able to increase the predictive performance over independent test data. We also show that integrating the partial supervision provided by censored data in a semi-supervised wrapper approach generally provides the best results, often achieving high improvements, compared to not using unlabeled data.