论文标题
代谢物注释的集合光谱预测(ESP)模型
Ensemble Spectral Prediction (ESP) Model for Metabolite Annotation
论文作者
论文摘要
代谢组学的一个关键挑战是从具有化学身份的生物样品中注释测得的光谱。当前,只能分配一小部分的测量标识。已经出现了两种互补的计算方法来解决注释问题:将候选分子映射到光谱,并将查询光谱映射到分子候选物中。从本质上讲,建议用最能解释查询谱的光谱的候选分子作为目标分子。尽管候选人排名在两种方法中都是基本的,但在确定目标分子方面,没有任何先前的工作利用等级学习任务。我们提出了一种新型的机器学习模型,集合光谱预测(ESP),用于代谢物注释。 ESP利用了使用多层感知器(MLP)网络和图形神经网络(GNNS)的先前基于神经网络的注释模型。根据MLP和GNN基于MLP和GNN的模型的排名结果,ESP了解了MLP和GNN光谱预测指标输出的加权,以生成查询分子的光谱预测。重要的是,训练数据是通过分子公式对模型训练期间提供候选组的分层分层的。此外,通过多头注意机制以及对频谱主题分布的多任务来考虑峰值依赖性,可以增强基线MLP和GNN模型。 ESP分别比MLP和GNN基准分别提高了41%和30%的平均排名,这表明与最先进的神经网络方法相比,表现出显着的性能增长。我们表明,对于ESP和其他模型,注释性能是候选集合中分子数量的强大功能及其与目标分子的相似性。
A key challenge in metabolomics is annotating measured spectra from a biological sample with chemical identities. Currently, only a small fraction of measurements can be assigned identities. Two complementary computational approaches have emerged to address the annotation problem: mapping candidate molecules to spectra, and mapping query spectra to molecular candidates. In essence, the candidate molecule with the spectrum that best explains the query spectrum is recommended as the target molecule. Despite candidate ranking being fundamental in both approaches, no prior works utilized rank learning tasks in determining the target molecule. We propose a novel machine learning model, Ensemble Spectral Prediction (ESP), for metabolite annotation. ESP takes advantage of prior neural network-based annotation models that utilize multilayer perceptron (MLP) networks and Graph Neural Networks (GNNs). Based on the ranking results of the MLP and GNN-based models, ESP learns a weighting for the outputs of MLP and GNN spectral predictors to generate a spectral prediction for a query molecule. Importantly, training data is stratified by molecular formula to provide candidate sets during model training. Further, baseline MLP and GNN models are enhanced by considering peak dependencies through multi-head attention mechanism and multi-tasking on spectral topic distributions. ESP improves average rank by 41% and 30% over the MLP and GNN baselines, respectively, demonstrating remarkable performance gain over state-of-the-art neural network approaches. We show that annotation performance, for ESP and other models, is a strong function of the number of molecules in the candidate set and their similarity to the target molecule.