学习蛋白质折叠模拟的几何分解表示

论文标题

学习蛋白质折叠模拟的几何分解表示

Learning Geometrically Disentangled Representations of Protein Folding Simulations

论文作者

Tatro, N. Joseph, Das, Payel, Chen, Pin-Yu, Chenthamarakshan, Vijil, Lai, Rongjie

论文摘要

对药物靶标蛋白的大规模分子模拟已被用作了解疾病机制和发展疗法的工具。这项工作着重于在药物目标蛋白的结构合奏中学习生成性神经网络，例如SARS-COV-2尖峰蛋白，从计算昂贵的分子模拟获得。模型任务涉及表征与各种药物分子结合的蛋白质的独特结构波动，以及有效地生成蛋白质构象，这些蛋白质构象可以作为分子模拟发动机的补充。具体而言，我们提出了一个几何自动编码器框架，以学习蛋白质结构内在和外在几何形状的单独的潜在空间编码。为此，对蛋白质接触图和蛋白质的主链键的方向进行了训练，培训了提出的蛋白质几何自动编码器（Progae）模型。使用Progae潜在的嵌入，我们在实验分辨率或附近重建并生成蛋白质的构象合奏，同时在从学到的潜在空间中获得蛋白质结构的术语中更好的可解释性和可控性。另外，Progae模型可以转移到相同蛋白质的不同状态或不同大小的新蛋白质，在这种蛋白质中，只有从潜在表示的密集层需要重新训练。结果表明，我们的基于几何学习的方法既具有产生复杂的结构变化的准确性和效率，从而绘制了通往可扩展和改进方法的路径，以分析和增强对药物目标蛋白的高成本模拟。

Massive molecular simulations of drug-target proteins have been used as a tool to understand disease mechanism and develop therapeutics. This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein, e.g. SARS-CoV-2 Spike protein, obtained from computationally expensive molecular simulations. Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules, as well as efficient generation of protein conformations that can serve as an complement of a molecular simulation engine. Specifically, we present a geometric autoencoder framework to learn separate latent space encodings of the intrinsic and extrinsic geometries of the protein structure. For this purpose, the proposed Protein Geometric AutoEncoder (ProGAE) model is trained on the protein contact map and the orientation of the backbone bonds of the protein. Using ProGAE latent embeddings, we reconstruct and generate the conformational ensemble of a protein at or near the experimental resolution, while gaining better interpretability and controllability in term of protein structure generation from the learned latent space. Additionally, ProGAE models are transferable to a different state of the same protein or to a new protein of different size, where only the dense layer decoding from the latent representation needs to be retrained. Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations, charting the path toward scalable and improved approaches for analyzing and enhancing high-cost simulations of drug-target proteins.

下载PDF全文

下载文献需遵守相关版权规定

论文标题