基准测试机器学习COVID-19基因组序列分类的鲁棒性

论文标题

基准测试机器学习COVID-19基因组序列分类的鲁棒性

Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification

论文作者

Ali, Sarwan, Sahoo, Bikram, Zelikovskiy, Alexander, Chen, Pin-Yu, Patterson, Murray

论文摘要

COVID-19大流行的迅速传播导致SARS-COV-2基因组的序列数据量很大，数百万序列和计数。这一数量的数据量超出了传统方法的能力，以理解病毒的多样性，动态和演变的能力，这是机器学习（ML）方法的丰富资源（ML）方法，是从这些数据中提取此类重要信息的替代方法。因此，设计一个用于测试和基准测试这些ML模型的鲁棒性的框架至关重要。本文（据我们所知）首次努力通过使用错误模拟生物学序列来基准ML模型的鲁棒性。在本文中，我们介绍了几种方法来扰动SARS-COV-2基因组序列，以模仿普通测序平台（例如Illumina和pacbio）的误差曲线。我们从各种ML模型上的实验中显示，对于某些特定的嵌入方法，某些基于仿真的方法比其他针对输入序列的对抗性攻击更健壮（和准确）。我们的基准测试框架可以帮助研究人员正确评估不同的ML模型，并帮助他们了解SARS-COV-2病毒的行为或避免未来可能的大流行。

The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome -- millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题