论文标题
系统发育对蛋白质序列数据的结构接触推断的影响
Impact of phylogeny on structural contact inference from protein sequence data
论文作者
论文摘要
已经开发了局部和全局推理方法来从同源蛋白的多个序列比对来推断结构接触。他们依靠在接触站点中使用氨基酸的相关性。由于同源蛋白具有共同的血统,因此它们的序列还具有系统发育相关性,这可能会损害接触推断。我们通过从最小模型中生成受控的合成数据来研究这种效果,在该模型中可以调节接触和系统发育的重要性。我们证明,基于协方差或共同信息,与本地方法相比,全球推断方法,特别是POTTS模型,对系统发育相关性更具弹性。这是否使用了系统发育校正,可以解释全球方法的成功。我们分析了选择强度和系统发育相关性的作用。我们表明,在系统发育中突变的位点会产生假阳性接触。我们考虑自然数据和现实的合成数据,我们的发现概括了这些情况。我们的结果突出了系统发育对蛋白质序列的接触预测的影响,并说明了生物数据的丰富结构与推理之间的相互作用。
Local and global inference methods have been developed to infer structural contacts from multiple sequence alignments of homologous proteins. They rely on correlations in amino-acid usage at contacting sites. Because homologous proteins share a common ancestry, their sequences also feature phylogenetic correlations, which can impair contact inference. We investigate this effect by generating controlled synthetic data from a minimal model where the importance of contacts and of phylogeny can be tuned. We demonstrate that global inference methods, specifically Potts models, are more resilient to phylogenetic correlations than local methods, based on covariance or mutual information. This holds whether or not phylogenetic corrections are used, and may explain the success of global methods. We analyse the roles of selection strength and of phylogenetic relatedness. We show that sites that mutate early in the phylogeny yield false positive contacts. We consider natural data and realistic synthetic data, and our findings generalise to these cases. Our results highlight the impact of phylogeny on contact prediction from protein sequences and illustrate the interplay between the rich structure of biological data and inference.