论文标题
诱变:一个SEQ2SEQ GAN框架,用于预测不断发展的蛋白质种群的突变
MutaGAN: A Seq2seq GAN Framework to Predict Mutations of Evolving Protein Populations
论文作者
论文摘要
预测病原体演变的能力将显着提高控制,预防和治疗疾病的能力。尽管在其他问题空间中取得了重大进展,但深度学习尚未为预测不断发展的人群突变的问题做出贡献。为了解决这一差距,我们使用具有复发性神经网络(RNN)的生成对抗网络(GAN)开发了一种新颖的机器学习框架,以准确预测未来生物学种群的遗传突变和进化。使用自举的最大似然树估计的蛋白质演化的通用时间可转换的系统发育模型,我们训练了一个名为“诱变”的对抗框架内的序列到序列发生器,以生成随可能的未来病毒种群突变增强的完整蛋白质序列。对于该深度学习框架,流感病毒序列被确定为理想的测试用例,因为它是一种重要的人类病原体,每年都有新的菌株,全球监视工作已从国家生物技术信息中心(NCBI)流感(NCBI)流型型流感病毒资源(IVR)产生了大量公开数据。诱变从给定的“父”蛋白序列中位数为2.00氨基酸的中位距离产生了“子”序列。此外,发电机能够增强在全球流感病毒种群中至少鉴定出一个突变的大多数父蛋白。这些结果证明了诱变框架有助于病原体预测的力量,对任何蛋白质种群的进化预测中的广泛效用都有意义。
The ability to predict the evolution of a pathogen would significantly improve the ability to control, prevent, and treat disease. Despite significant progress in other problem spaces, deep learning has yet to contribute to the issue of predicting mutations of evolving populations. To address this gap, we developed a novel machine learning framework using generative adversarial networks (GANs) with recurrent neural networks (RNNs) to accurately predict genetic mutations and evolution of future biological populations. Using a generalized time-reversible phylogenetic model of protein evolution with bootstrapped maximum likelihood tree estimation, we trained a sequence-to-sequence generator within an adversarial framework, named MutaGAN, to generate complete protein sequences augmented with possible mutations of future virus populations. Influenza virus sequences were identified as an ideal test case for this deep learning framework because it is a significant human pathogen with new strains emerging annually and global surveillance efforts have generated a large amount of publicly available data from the National Center for Biotechnology Information's (NCBI) Influenza Virus Resource (IVR). MutaGAN generated "child" sequences from a given "parent" protein sequence with a median Levenshtein distance of 2.00 amino acids. Additionally, the generator was able to augment the majority of parent proteins with at least one mutation identified within the global influenza virus population. These results demonstrate the power of the MutaGAN framework to aid in pathogen forecasting with implications for broad utility in evolutionary prediction for any protein population.