论文标题
BATMANNET:用于分子表示的双分支蒙版图形变压器自动编码器
BatmanNet: Bi-branch Masked Graph Transformer Autoencoder for Molecular Representation
论文作者
论文摘要
尽管使用图神经网络(GNN)进行了AI驱动的药物发现(AIDD),但有效的分子表示学习仍然是一个开放的挑战,尤其是在标记分子不足的情况下,有效的分子表示学习仍然是一项挑战,尽管已经做出了大量的努力。最近的研究表明,在未标记的数据集上通过自我监督的学习预先训练的大型GNN模型可以在下游分子属性预测任务中更好地转移性能。但是,这些研究中的方法需要多个复杂的自我监督任务和大规模数据集,这些任务耗时,计算上昂贵且难以端到端培训。在这里,我们设计了一种简单而有效的自我监督策略,以同时学习有关分子的本地和全球信息,并进一步提出了一种新颖的双支掩盖图形变压器自动编码器(BATMANNET)来学习分子表示。 Batmannet具有两个量身定制的互补和不对称图自动编码器,分别从掩盖的分子图中重建缺失的节点和边缘。通过这种设计,Batmannet可以有效地捕获分子的潜在结构和语义信息,从而提高分子表示的性能。 Batmannet在13个基准数据集上实现了多种药物发现任务的最先进结果,包括分子特性预测,药物 - 药物相互作用和药物目标相互作用,表明其在分子表示学习中的巨大潜力和优势。
Although substantial efforts have been made using graph neural networks (GNNs) for AI-driven drug discovery (AIDD), effective molecular representation learning remains an open challenge, especially in the case of insufficient labeled molecules. Recent studies suggest that big GNN models pre-trained by self-supervised learning on unlabeled datasets enable better transfer performance in downstream molecular property prediction tasks. However, the approaches in these studies require multiple complex self-supervised tasks and large-scale datasets, which are time-consuming, computationally expensive, and difficult to pre-train end-to-end. Here, we design a simple yet effective self-supervised strategy to simultaneously learn local and global information about molecules, and further propose a novel bi-branch masked graph transformer autoencoder (BatmanNet) to learn molecular representations. BatmanNet features two tailored complementary and asymmetric graph autoencoders to reconstruct the missing nodes and edges, respectively, from a masked molecular graph. With this design, BatmanNet can effectively capture the underlying structure and semantic information of molecules, thus improving the performance of molecular representation. BatmanNet achieves state-of-the-art results for multiple drug discovery tasks, including molecular properties prediction, drug-drug interaction, and drug-target interaction, on 13 benchmark datasets, demonstrating its great potential and superiority in molecular representation learning.