对深度卷积神经网络架构进行筛查胸部X光片的系统搜索

论文标题

对深度卷积神经网络架构进行筛查胸部X光片的系统搜索

A Systematic Search over Deep Convolutional Neural Network Architectures for Screening Chest Radiographs

论文作者

Mitra, Arka, Chakravarty, Arunava, Ghosh, Nirmalya, Sarkar, Tandra, Sethuraman, Ramanathan, Sheet, Debdoot

论文摘要

胸部X光片主要用于筛查肺部和心脏/胸部条件。他们需要在初级医疗保健中心进行，他们需要在本地报告放射科医生的存在，这在低收入国家和中等收入国家是一个挑战。这启发了基于机器学习的筛查过程的自动化的发展。尽管最近的努力证明了使用深层卷积神经网络（CNN）的合奏表明性能基准，但我们对多个标准CNN体系结构进行的系统搜索确定了单个候选CNN模型，其分类性能与集合相提并论。超过63个实验，跨越了400小时，在11：3 fp32 tensortflops Compute系统上执行，我们发现Xception和Resnet-18体系结构在识别九种病理学的平均AUC为0.87的共存疾病条件方面是一致的绩效。我们通过评估了使用随机输入采样（RISE）方法生成的显着性图来结束模型的可靠性，并通过从经验丰富的放射科医生那里采购的手动注释进行定性验证它们。我们还对公开可用的CHEXPERT数据集的局限性提出了关键的注释，这主要是由于培训与测试集的类别分布的差异，以及对于几个类别的未充分样本，这会阻碍由于样本不足而引起的量化量化报告。

Chest radiographs are primarily employed for the screening of pulmonary and cardio-/thoracic conditions. Being undertaken at primary healthcare centers, they require the presence of an on-premise reporting Radiologist, which is a challenge in low and middle income countries. This has inspired the development of machine learning based automation of the screening process. While recent efforts demonstrate a performance benchmark using an ensemble of deep convolutional neural networks (CNN), our systematic search over multiple standard CNN architectures identified single candidate CNN models whose classification performances were found to be at par with ensembles. Over 63 experiments spanning 400 hours, executed on a 11:3 FP32 TensorTFLOPS compute system, we found the Xception and ResNet-18 architectures to be consistent performers in identifying co-existing disease conditions with an average AUC of 0.87 across nine pathologies. We conclude on the reliability of the models by assessing their saliency maps generated using the randomized input sampling for explanation (RISE) method and qualitatively validating them against manual annotations locally sourced from an experienced Radiologist. We also draw a critical note on the limitations of the publicly available CheXpert dataset primarily on account of disparity in class distribution in training vs. testing sets, and unavailability of sufficient samples for few classes, which hampers quantitative reporting due to sample insufficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题