论文标题
咳嗽众包数据集:大规模咳嗽分析算法研究的语料库
The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms
论文作者
论文摘要
咳嗽音频信号分类已成功地用于诊断各种呼吸系统疾病,并且对利用机器学习(ML)的兴趣引起了极大的兴趣,以提供广泛的Covid-19-19-19。但是,目前尚无对训练此类ML模型进行咳嗽声音的验证数据库。咳嗽数据集提供了20,000多个众包记录,代表了广泛的学科年龄,性别,地理位置和Covid-19状态。首先,我们使用开源咳嗽检测算法过滤了数据集。其次,经验丰富的肺科医生标记了2,000多个记录,以诊断咳嗽中存在的医学异常,从而贡献了现有的最大专家标签的咳嗽数据集之一,可用于多余的咳嗽音频分类任务。最后,我们确保了标记为有症状的咳嗽,Covid-19起源于感染率高的国家,并且其专家标签是一致的。结果,咳嗽数据集为培训ML模型贡献了大量咳嗽记录,以解决世界上最紧急的健康危机。
Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. However, there is currently no validated database of cough sounds with which to train such ML models. The COUGHVID dataset provides over 20,000 crowdsourced cough recordings representing a wide range of subject ages, genders, geographic locations, and COVID-19 statuses. First, we filtered the dataset using our open-sourced cough detection algorithm. Second, experienced pulmonologists labeled more than 2,000 recordings to diagnose medical abnormalities present in the coughs, thereby contributing one of the largest expert-labeled cough datasets in existence that can be used for a plethora of cough audio classification tasks. Finally, we ensured that coughs labeled as symptomatic and COVID-19 originate from countries with high infection rates, and that their expert labels are consistent. As a result, the COUGHVID dataset contributes a wealth of cough recordings for training ML models to address the world's most urgent health crises.