论文标题
从巴西 - 葡萄牙临床注释中预测多个ICD-10代码
Predicting Multiple ICD-10 Codes from Brazilian-Portuguese Clinical Notes
论文作者
论文摘要
来自电子临床记录的ICD编码是手动,耗时且昂贵的过程。但是,代码分配是计费目的和数据库组织的重要任务。尽管许多作品都使用机器学习技术从自由文本中研究了自动化ICD编码的问题,但大多数使用英语中的记录,尤其是来自模拟的III公共数据集中。这项工作为带有巴西葡萄牙临床注释的数据集提供了结果。我们开发并优化了逻辑回归模型,卷积神经网络(CNN),一个封闭的复发单位神经网络和具有注意力(CNN-ATT)的CNN,以预测ICD代码的诊断。我们还报告了模拟III数据集的结果,该数据集在同一家族的模型以及最新的状态之间的表现优于以前的工作。与模仿III相比,当仅使用放电摘要时,巴西葡萄牙数据集包含每个文档的单词少得多。我们试验该数据集中可用的其他文档的串联,从而极大地提高了性能。 CNN-ATT模型在两个数据集上都取得了最佳结果,Mimic-III的微平均F1得分为0.537,在我们的数据集上具有0.485,并带有其他文档。
ICD coding from electronic clinical records is a manual, time-consuming and expensive process. Code assignment is, however, an important task for billing purposes and database organization. While many works have studied the problem of automated ICD coding from free text using machine learning techniques, most use records in the English language, especially from the MIMIC-III public dataset. This work presents results for a dataset with Brazilian Portuguese clinical notes. We develop and optimize a Logistic Regression model, a Convolutional Neural Network (CNN), a Gated Recurrent Unit Neural Network and a CNN with Attention (CNN-Att) for prediction of diagnosis ICD codes. We also report our results for the MIMIC-III dataset, which outperform previous work among models of the same families, as well as the state of the art. Compared to MIMIC-III, the Brazilian Portuguese dataset contains far fewer words per document, when only discharge summaries are used. We experiment concatenating additional documents available in this dataset, achieving a great boost in performance. The CNN-Att model achieves the best results on both datasets, with micro-averaged F1 score of 0.537 on MIMIC-III and 0.485 on our dataset with additional documents.