生物医学文献的多标签分类：生物依据的VII vii litcovid曲目Covid-19文学主题注释

论文标题

生物医学文献的多标签分类：生物依据的VII vii litcovid曲目Covid-19文学主题注释

Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations

论文作者

Chen, Qingyu, Allot, Alexis, Leaman, Robert, Doğan, Rezarta Islamaj, Du, Jingcheng, Fang, Li, Wang, Kai, Xu, Shuo, Zhang, Yuefu, Bagherzadeh, Parsa, Bergler, Sabine, Bhatnagar, Aakash, Bhavsar, Nidhir, Chang, Yung-Chun, Lin, Sheng-Jie, Tang, Wentai, Zhang, Hongtong, Tavchioski, Ilija, Pollak, Senja, Tian, Shubo, Zhang, Jinfeng, Otmakhova, Yulia, Yepes, Antonio Jimeno, Dong, Hang, Wu, Honghan, Dufour, Richard, Labrak, Yanis, Chatterjee, Niladri, Tandon, Kushagri, Laleye, Fréjus, Rakotoson, Loïc, Chersoni, Emmanuele, Gu, Jinghang, Friedrich, Annemarie, Pujari, Subhash Chandra, Chizhikova, Mariia, Sivadasan, Naveen, Sivadasan, Naveen, Lu, Zhiyong

论文摘要

自2019年12月以来，COVID-19的大流行一直在严重影响全球社会。已进行大规模研究以了解病毒和设计疫苗和药物的特征。相关发现在生物医学文献中报道了每月Covid-19的约10,000篇文章。如此快速的增长显着挑战了手动策划和解释。例如，Litcovid是PubMed中与Covid-19相关文章的文献数据库，该文献已累积了200,000多种文章，每月用户在全球每月访问数百万篇文章。一项主要的策划任务是将多达八个主题（例如诊断和治疗）分配给Litcovid中的文章。尽管生物医学文本挖掘方法取得了持续的进步，但很少有专门用于COVID-19文献中的主题注释。为了缩小差距，我们组织了生物上的Litcovid曲目，呼吁社区努力解决Covid-19文献的自动主题注释。由30,000篇具有手动审查主题的文章组成的生物库存Litcovid数据集是用于培训和测试的。它是生物医学科学文献中最大的多标记分类数据集之一。全球19个团队参加了比赛，并总共提交了80个意见。大多数团队使用基于变压器的混合系统。宏F1得分，微型F1得分和基于实例的F1得分分别达到了0.8875、0.9181和0.9394的表现最高。参与水平和结果表明了成功的轨道，并有助于缩小数据集策划与方法开发之间的差距。该数据集可通过https://ftp.ncbi.nlm.nih.gov/pub/lu/litcovid/biocreative/进行基准测试和进一步开发。

The COVID-19 pandemic has been severely impacting global society since December 2019. Massive research has been undertaken to understand the characteristics of the virus and design vaccines and drugs. The related findings have been reported in biomedical literature at a rate of about 10,000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretation. For instance, LitCovid is a literature database of COVID-19-related articles in PubMed, which has accumulated more than 200,000 articles with millions of accesses each month by users worldwide. One primary curation task is to assign up to eight topics (e.g., Diagnosis and Treatment) to the articles in LitCovid. Despite the continuing advances in biomedical text mining methods, few have been dedicated to topic annotations in COVID-19 literature. To close the gap, we organized the BioCreative LitCovid track to call for a community effort to tackle automated topic annotation for COVID-19 literature. The BioCreative LitCovid dataset, consisting of over 30,000 articles with manually reviewed topics, was created for training and testing. It is one of the largest multilabel classification datasets in biomedical scientific literature. 19 teams worldwide participated and made 80 submissions in total. Most teams used hybrid systems based on transformers. The highest performing submissions achieved 0.8875, 0.9181, and 0.9394 for macro F1-score, micro F1-score, and instance-based F1-score, respectively. The level of participation and results demonstrate a successful track and help close the gap between dataset curation and method development. The dataset is publicly available via https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/biocreative/ for benchmarking and further development.

下载PDF全文

下载文献需遵守相关版权规定

论文标题