带有标签 - 敏捷结合对的训练前蛋白质语言模型增强了下游任务的性能

论文标题

带有标签 - 敏捷结合对的训练前蛋白质语言模型增强了下游任务的性能

Pre-training Protein Language Models with Label-Agnostic Binding Pairs Enhances Performance in Downstream Tasks

论文作者

Filipavicius, Modestas, Manica, Matteo, Cadow, Joris, Martinez, Maria Rodriguez

论文摘要

蛋白质序列的少于1％在结构和功能上注释。自然语言处理（NLP）社区最近将自我监督的学习作为一种有力的方法，是一种从未标记的文本中学习表示形式的强大方法，这在很大程度上是由于基于注意力的上下文感知到的变压器模型。在这项工作中，我们通过在预训练期间输入结合和非结合蛋白序列（来自弦乐数据库）的混合物，对罗伯塔模型进行了修改。但是，由于模型仅依赖于预训练期间的掩盖语言建模（MLM）目标，因此该序列对没有标签来指示其结合状态。经过微调后，这种方法超过了在单蛋白序列上训练的模型，用于蛋白质 - 蛋白质结合预测，TCR-蛋白结合预测，细胞 - 位置和远程同源性分类任务。我们建议变压器的注意机制有助于蛋白质结合位点发现。此外，我们用字节对编码（BPE）词汇组成，由10K子词组成，每个词汇序列长64％，每个词汇长约3-4个氨基酸。最后，为了将模型输入空间扩展到甚至更大的蛋白质和多蛋白组件，我们预先培训了支持2,048个令牌的训练式长形模型。需要在代币级别的二级结构预测中进行进一步的工作。代码可用：https：//github.com/paccmann/paccmann_proteomics

Less than 1% of protein sequences are structurally and functionally annotated. Natural Language Processing (NLP) community has recently embraced self-supervised learning as a powerful approach to learn representations from unlabeled text, in large part due to the attention-based context-aware Transformer models. In this work we present a modification to the RoBERTa model by inputting during pre-training a mixture of binding and non-binding protein sequences (from STRING database). However, the sequence pairs have no label to indicate their binding status, as the model relies solely on Masked Language Modeling (MLM) objective during pre-training. After fine-tuning, such approach surpasses models trained on single protein sequences for protein-protein binding prediction, TCR-epitope binding prediction, cellular-localization and remote homology classification tasks. We suggest that the Transformer's attention mechanism contributes to protein binding site discovery. Furthermore, we compress protein sequences by 64% with the Byte Pair Encoding (BPE) vocabulary consisting of 10K subwords, each around 3-4 amino acids long. Finally, to expand the model input space to even larger proteins and multi-protein assemblies, we pre-train Longformer models that support 2,048 tokens. Further work in token-level classification for secondary structure prediction is needed. Code available at: https://github.com/PaccMann/paccmann_proteomics

下载PDF全文

下载文献需遵守相关版权规定

论文标题