Bartsmiles：分子表示的生成性蒙版语言模型

论文标题

Bartsmiles：分子表示的生成性蒙版语言模型

BARTSmiles: Generative Masked Language Models for Molecular Representations

论文作者

Chilingaryan, Gayane, Tamoyan, Hovhannes, Tevosyan, Ani, Babayan, Nelly, Khondkaryan, Lusine, Hambardzumyan, Karen, Navoyan, Zaven, Khachatrian, Hrant, Aghajanyan, Armen

论文摘要

我们通过一系列量身定制的，深入的消融发现了针对生成性掩盖语言模型的分子表示量身定制的强大自我监督策略。使用这种预训练策略，我们训练Bartsmiles，这是一种比以前的自我监督分子表示的计算级的BART样模型。深入的评估表明，巴特斯米尔斯在分类，回归和生成任务之间始终优于其他自我监督的表示，为11个任务设定了新的最新时间。然后，我们定量地表明，当应用于分子结构域时，巴特目标将学习隐式编码我们感兴趣的下游任务的表示形式。例如，通过从冷冻的Bartsmiles中选择七个神经元，我们可以在任务Clintox上的完整微调模型的两个百分点内获得具有性能的模型。最后，我们表明，当应用于Bartsmiles时，标准归因方法强调了化学家用来解释分子特定特性的某些子结构。代码和验证的模型公开可用。

We discover a robust self-supervised strategy tailored towards molecular representations for generative masked language models through a series of tailored, in-depth ablations. Using this pre-training strategy, we train BARTSmiles, a BART-like model with an order of magnitude more compute than previous self-supervised molecular representations. In-depth evaluations show that BARTSmiles consistently outperforms other self-supervised representations across classification, regression, and generation tasks setting a new state-of-the-art on 11 tasks. We then quantitatively show that when applied to the molecular domain, the BART objective learns representations that implicitly encode our downstream tasks of interest. For example, by selecting seven neurons from a frozen BARTSmiles, we can obtain a model having performance within two percentage points of the full fine-tuned model on task Clintox. Lastly, we show that standard attribution interpretability methods, when applied to BARTSmiles, highlight certain substructures that chemists use to explain specific properties of molecules. The code and the pretrained model are publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题