论文标题
基于内核的机器学习的水合自由能:化合物数据库偏差
Hydration free energies from kernel-based machine learning: Compound-database bias
论文作者
论文摘要
我们考虑了基本热力学特性的预测---在小有机分子的一个化学空间中,水合自由能。我们的计算机研究基于与隐式溶剂的原子级别的计算机模拟。我们报告了一种基于内核的机器学习方法,该方法是受到学习电子属性的最新工作的启发,但在关键方面有所不同:对几种构象异构体的平均表示,以说明统计合奏。我们还包括一个原子分解ANSATZ,与分子学习相比,我们显示出可显着的可传递性。最后,我们探讨了实验化合物数据库中严重偏见的存在。通过降低维度和交叉学习模型的组合,我们表明学习率显着取决于培训数据集的广度和多样性。我们的研究突出了将机器学习模型安装到狭窄化学范围的数据库中的危险。
We consider the prediction of a basic thermodynamic property---hydration free energies---across a large subset of the chemical space of small organic molecules. Our in silico study is based on computer simulations at the atomistic level with implicit solvent. We report on a kernel-based machine learning approach that is inspired by recent work in learning electronic properties, but differs in key aspects: The representation is averaged over several conformers to account for the statistical ensemble. We also include an atomic-decomposition ansatz, which we show offers significant added transferability compared to molecular learning. Finally, we explore the existence of severe biases from databases of experimental compounds. By performing a combination of dimensionality reduction and cross-learning models, we show that the rate of learning depends significantly on the breadth and variety of the training dataset. Our study highlights the dangers of fitting machine-learning models to databases of narrow chemical range.