蛋白质的多尺度表示学习

论文标题

蛋白质的多尺度表示学习

Multi-Scale Representation Learning on Proteins

论文作者

Somnath, Vignesh Ram, Bunne, Charlotte, Krause, Andreas

论文摘要

蛋白质是介导细胞功能和疾病中关键作用的基本生物学实体。本文介绍了蛋白质的多尺度图构造 - Holoprot-将表面连接到结构和序列。表面捕获了蛋白质的更粗糙的细节，而序列为主要成分和结构（包括次级和第三纪组件） - 捕获更精细的细节。然后，我们的图形编码器通过允许每个级别从下面的级别与该级别的图进行集成编码来学习多尺度表示。我们在不同任务，（i。）配体结合亲和力（回归）和（ii。）蛋白质函数预测（分类）上测试了学习的表示形式。在回归任务上，与以前的方法相反，我们的模型在不同的数据集拆分上始终如一，可靠地执行，超过了大多数拆分的所有基准。在分类任务上，它可以在使用少10倍的参数时达到靠近表现最佳模型的性能。为了提高构造的记忆效率，我们将多重蛋白表面歧管分割为分子超像素，并用这些超像素以几乎没有性能损失代替表面。

Proteins are fundamental biological entities mediating key roles in cellular function and disease. This paper introduces a multi-scale graph construction of a protein -- HoloProt -- connecting surface to structure and sequence. The surface captures coarser details of the protein, while sequence as primary component and structure -- comprising secondary and tertiary components -- capture finer details. Our graph encoder then learns a multi-scale representation by allowing each level to integrate the encoding from level(s) below with the graph at that level. We test the learned representation on different tasks, (i.) ligand binding affinity (regression), and (ii.) protein function prediction (classification). On the regression task, contrary to previous methods, our model performs consistently and reliably across different dataset splits, outperforming all baselines on most splits. On the classification task, it achieves a performance close to the top-performing model while using 10x fewer parameters. To improve the memory efficiency of our construction, we segment the multiplex protein surface manifold into molecular superpixels and substitute the surface with these superpixels at little to no performance loss.

下载PDF全文

下载文献需遵守相关版权规定

论文标题