蛋白质结构类别的深入学习：有什么证据表明“露头”？

论文标题

蛋白质结构类别的深入学习：有什么证据表明“露头”？

Deep Learning of Protein Structural Classes: Any Evidence for an 'Urfold'?

论文作者

Jaiswal, Menuka, Saleem, Saad, Kweon, Yonghyeon, Draizen, Eli J, Veretnik, Stella, Mura, Cameron, Bourne, Philip E.

论文摘要

现在，从氨基酸序列对蛋白质三维（3D）结构进行准确预测的最新计算进步现在提供了一个独特的机会，可以破译蛋白质之间的相互关系。这项任务需要 - 但不等于 - 3D结构比较和分类的问题。从历史上看，蛋白质领域的分类在很大程度上是手动和主观的活动，依赖于各种启发式方法。诸如CATH之类的数据库代表了采用更系统（和自动）方法的重要步骤，但仍有很大的空间来开发以机器学习为基础的更可扩展和定量的分类方法。我们怀疑通过深度学习（DL）方法重新审视这些关系可能需要对分类方案进行大规模的重组，从而提高了蛋白质之间遥远关系的可解释性。在这里，我们描述了我们对蛋白质结构结构（及其相关的理化特性）的DL模型的培训，以评估CATH的“同源超家族”（SF）水平的分类特性。为了实现这一目标，我们利用卷积自动编码器模型体系结构设计和应用了图像分类方法和图像分割技术的扩展。我们的DL体系结构允许模型学习结构特征，从某种意义上说，这些特征“定义”了不同的同源SFS。我们通过构建每个SF模型并比较模型的损耗函数来评估和量化SF之间的成对“距离”。这些距离矩阵上的分层聚类提供了蛋白质相互关系的新视图，这种视图超出了简单的结构/几何相似性，并延伸到结构/功能属性的领域。

Recent computational advances in the accurate prediction of protein three-dimensional (3D) structures from amino acid sequences now present a unique opportunity to decipher the interrelationships between proteins. This task entails--but is not equivalent to--a problem of 3D structure comparison and classification. Historically, protein domain classification has been a largely manual and subjective activity, relying upon various heuristics. Databases such as CATH represent significant steps towards a more systematic (and automatable) approach, yet there still remains much room for the development of more scalable and quantitative classification methods, grounded in machine learning. We suspect that re-examining these relationships via a Deep Learning (DL) approach may entail a large-scale restructuring of classification schemes, improved with respect to the interpretability of distant relationships between proteins. Here, we describe our training of DL models on protein domain structures (and their associated physicochemical properties) in order to evaluate classification properties at CATH's "homologous superfamily" (SF) level. To achieve this, we have devised and applied an extension of image-classification methods and image segmentation techniques, utilizing a convolutional autoencoder model architecture. Our DL architecture allows models to learn structural features that, in a sense, 'define' different homologous SFs. We evaluate and quantify pairwise 'distances' between SFs by building one model per SF and comparing the loss functions of the models. Hierarchical clustering on these distance matrices provides a new view of protein interrelationships--a view that extends beyond simple structural/geometric similarity, and towards the realm of structure/function properties.

下载PDF全文

下载文献需遵守相关版权规定

论文标题