论文标题
揭示分子财产预测的基础关键要素:系统研究
Unraveling Key Elements Underlying Molecular Property Prediction: A Systematic Study
论文作者
论文摘要
人工智能(AI)已被广泛应用于药物发现中,其主要任务是分子财产预测。尽管在分子表示学习中采用蓬勃发展的技术,但基本分子性质预测的关键要素仍未得到探索,这阻碍了该领域的进一步进步。本文中,我们使用分子数据集,一组与阿片类药物相关的数据集以及文献中的两个其他活动数据集对代表性模型进行了广泛的评估。为了研究低数据和高数据空间中的预测能力,还组装了一系列不同大小的描述符数据集以评估模型。总的来说,我们已经培训了62,820个型号,其中包括固定表示形式的50,220款模型,微笑序列上的4,200款模型和分子图上的8,400款模型。基于广泛的实验和严格的比较,我们表明表示模型在大多数数据集中的分子属性预测表现有限。此外,分子属性预测的基础多个关键要素可能会影响评估结果。此外,我们表明活动悬崖可以显着影响模型预测。最后,我们探讨了代表学习模型可能失败的潜在原因,并表明数据集大小对于表示模型的表现至关重要。
Artificial intelligence (AI) has been widely applied in drug discovery with a major task as molecular property prediction. Despite booming techniques in molecular representation learning, key elements underlying molecular property prediction remain largely unexplored, which impedes further advancements in this field. Herein, we conduct an extensive evaluation of representative models using various representations on the MoleculeNet datasets, a suite of opioids-related datasets and two additional activity datasets from the literature. To investigate the predictive power in low-data and high-data space, a series of descriptors datasets of varying sizes are also assembled to evaluate the models. In total, we have trained 62,820 models, including 50,220 models on fixed representations, 4,200 models on SMILES sequences and 8,400 models on molecular graphs. Based on extensive experimentation and rigorous comparison, we show that representation learning models exhibit limited performance in molecular property prediction in most datasets. Besides, multiple key elements underlying molecular property prediction can affect the evaluation results. Furthermore, we show that activity cliffs can significantly impact model prediction. Finally, we explore into potential causes why representation learning models can fail and show that dataset size is essential for representation learning models to excel.