论文标题

SEHR-CE:结构化EHR数据的语言建模,以进行有效且可推广的患者同类群体扩展

sEHR-CE: Language modelling of structured EHR data for efficient and generalizable patient cohort expansion

论文作者

Munoz-Farre, Anna, Rose, Harry, Cakiroglu, Sera Aylin

论文摘要

电子健康记录(EHR)为深入的临床表型和临床结果预测提供了前所未有的机会。组合多个数据源对于产生疾病患病率,发病率和轨迹的完整情况至关重要。结合临床数据的标准方法涉及使用策划的地图整理跨不同术语系统的临床术语,这些图通常是不准确和/或不完整的。在这里,我们提出了SEHR-CE,这是一个基于变形金刚的新型框架,以实现对异质临床数据集进行集成的表型和分析,而无需依赖这些映射。我们使用概念的文本描述量统一临床术语,并表示个人的EHR作为文本部分。然后,我们微调预训练的语言模型,比非文本和单术语方法更准确地预测疾病表型。我们使用来自英国生物银行的初级和二级护理数据(一项大规模研究)来验证我们的方法。最后,我们说明了2型糖尿病使用情况中SEHR-CE如何识别患者与患者共享临床特征的个人。

Electronic health records (EHR) offer unprecedented opportunities for in-depth clinical phenotyping and prediction of clinical outcomes. Combining multiple data sources is crucial to generate a complete picture of disease prevalence, incidence and trajectories. The standard approach to combining clinical data involves collating clinical terms across different terminology systems using curated maps, which are often inaccurate and/or incomplete. Here, we propose sEHR-CE, a novel framework based on transformers to enable integrated phenotyping and analyses of heterogeneous clinical datasets without relying on these mappings. We unify clinical terminologies using textual descriptors of concepts, and represent individuals' EHR as sections of text. We then fine-tune pre-trained language models to predict disease phenotypes more accurately than non-text and single terminology approaches. We validate our approach using primary and secondary care data from the UK Biobank, a large-scale research study. Finally, we illustrate in a type 2 diabetes use case how sEHR-CE identifies individuals without diagnosis that share clinical characteristics with patients.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源