论文标题
表征具有持续同源指标的分子深生成模型的潜在空间
Characterizing the Latent Space of Molecular Deep Generative Models with Persistent Homology Metrics
论文作者
论文摘要
深层生成模型越来越成为硅分子设计管道中不可或缺的一部分,并且具有学习化学和结构特征的双重目标,这些特征使候选分子变得可行,同时也足够灵活,可以产生新颖的设计。具体而言,变异自动编码器(VAE)是生成模型,在该模型中,训练编码器 - 码头网络对重建培训数据分布,以使编码器网络的潜在空间平稳。因此,可以通过从这个潜在空间进行抽样来找到新颖的候选人。但是,体系结构和超参数的范围很大,并且选择最佳组合在计算机发现中对下游成功具有重要意义。因此,重要的是要开发一种原则性的方法来区分给定生成模型能够学习显着的分子特征的能力。在这项工作中,我们提出了一种方法,用于测量深层生成模型的潜在空间能够通过将潜在空间指标与拓扑数据分析领域(TDA)的指标相关联,可以编码分子数据集的结构和化学特征。我们将评估方法应用于对微笑字符串训练的VAE,并表明3D拓扑信息始终在模型的整个潜在空间中编码。
Deep generative models are increasingly becoming integral parts of the in silico molecule design pipeline and have dual goals of learning the chemical and structural features that render candidate molecules viable while also being flexible enough to generate novel designs. Specifically, Variational Auto Encoders (VAEs) are generative models in which encoder-decoder network pairs are trained to reconstruct training data distributions in such a way that the latent space of the encoder network is smooth. Therefore, novel candidates can be found by sampling from this latent space. However, the scope of architectures and hyperparameters is vast and choosing the best combination for in silico discovery has important implications for downstream success. Therefore, it is important to develop a principled methodology for distinguishing how well a given generative model is able to learn salient molecular features. In this work, we propose a method for measuring how well the latent space of deep generative models is able to encode structural and chemical features of molecular datasets by correlating latent space metrics with metrics from the field of topological data analysis (TDA). We apply our evaluation methodology to a VAE trained on SMILES strings and show that 3D topology information is consistently encoded throughout the latent space of the model.