论文标题
SEQ2mol:通过深神经网络调节的从头设计的自动设计
Seq2Mol: Automatic design of de novo molecules conditioned by the target protein sequences through deep neural networks
论文作者
论文摘要
从头设计的分子设计最近享有生成深神经网络的力量。当前的方法旨在产生类似于训练集分子的特性或相对于特定物理学特性进行优化的分子的特性。这些方法均未产生针对靶蛋白的分子。在此处介绍的方法中,我们引入了一种基于蛋白质靶序列的方法,以生成与靶标相关的从头分子。我们使用根据Google的“ Show and Tell”图像字幕生成方法改编的实现方法来生成来自由深双向语言模型Elmo产生的蛋白质序列嵌入的分子字符串。 Elmo用于生成蛋白质序列的上下文化嵌入向量。使用增强学习,通过增强的情节可能性进一步优化了训练的模型,以增加与训练集相比生成的化合物的多样性。我们使用该模型来生成两个主要药物靶标家族的化合物,即GPCR和酪氨酸激酶靶标。该模型生成的化合物在结构上是不同的训练组形式,同时也与已知与一组随机分子相比已知与两种药物靶标结合的化合物更相似。该化合物进一步显示出合理的合成性和吸毒性得分。
De novo design of molecules has recently enjoyed the power of generative deep neural networks. Current approaches aim to generate molecules either resembling the properties of the molecules of the training set or molecules that are optimized with respect to specific physicochemical properties. None of the methods generates molecules specific to a target protein. In the approach presented here, we introduce a method which is conditioned on the protein target sequence to generate de novo molecules that are relevant to the target. We use an implementation adapted from Google's "Show and Tell" image caption generation method, to generate SMILES strings of molecules from protein sequence embeddings generated by a deep bi-directional language model ELMo. ELMo is used to generate contextualized embedding vectors of the protein sequence. Using reinforcement learning, the trained model is further optimized through augmented episodic likelihood to increase the diversity of the generated compounds compared to the training set. We used the model to generate compounds for two major drug target families, i.e. for GPCRs and Tyrosine Kinase targets. The model generated compounds which are structurally different form the training set, while also being more similar to compounds known to bind to the two families of drug targets compared to a random set of molecules. The compounds further display reasonable synthesizability and drug-likeness scores.