通过玻尔兹曼机器的参数还原的稀疏生成建模：应用于蛋白质序列家族

论文标题

通过玻尔兹曼机器的参数还原的稀疏生成建模：应用于蛋白质序列家族

Sparse generative modeling via parameter-reduction of Boltzmann machines: application to protein-sequence families

论文作者

Barrat-Charlaix, Pierre, Muntoni, Anna Paola, Shimagaki, Kai, Weigt, Martin, Zamponi, Francesco

论文摘要

Boltzmann机器（BM）被广泛用作生成模型。例如，是BM类实例的成对POTTS模型（PM）提供了与进化相关的蛋白质序列家族的准确统计模型。它们的参数是局部场，它们描述了氨基酸保护的特定位点特异性模式，以及两个位点耦合，它们反映了对位点对之间的共同进化。该协同进化反映了作用于进化过程中蛋白质序列的结构和功能约束。描述协同进化信号的最保守的选择是将所有可能的两个站点耦合包括到PM中。这种选择，即所谓的直接耦合分析，已经成功地预测了三维结构，突变效应和生成新功能序列的残基接触。但是，由此产生的PM遭受了重要的过度效果：许多耦合很小，嘈杂，几乎不可解释； PM接近临界点，这意味着它对小参数扰动高度敏感。在这项工作中，我们通过对统计上显着较低的耦合的受控迭代删除引入了BMS的一般参数还原程序，该耦合通过基于信息的标准确定，该标准选择了弱或统计上不支持的耦合。对于几个蛋白质家族，我们的过程允许一个人删除PM耦合的$ 90 \％$ $，同时保留原始密集PM的预测性和生成性能，并且所得模型远离关键性，因此对噪声更强大。

Boltzmann machines (BM) are widely used as generative models. For example, pairwise Potts models (PM), which are instances of the BM class, provide accurate statistical models of families of evolutionarily related protein sequences. Their parameters are the local fields, which describe site-specific patterns of amino-acid conservation, and the two-site couplings, which mirror the coevolution between pairs of sites. This coevolution reflects structural and functional constraints acting on protein sequences during evolution. The most conservative choice to describe the coevolution signal is to include all possible two-site couplings into the PM. This choice, typical of what is known as Direct Coupling Analysis, has been successful for predicting residue contacts in the three-dimensional structure, mutational effects, and in generating new functional sequences. However, the resulting PM suffers from important over-fitting effects: many couplings are small, noisy and hardly interpretable; the PM is close to a critical point, meaning that it is highly sensitive to small parameter perturbations. In this work, we introduce a general parameter-reduction procedure for BMs, via a controlled iterative decimation of the less statistically significant couplings, identified by an information-based criterion that selects either weak or statistically unsupported couplings. For several protein families, our procedure allows one to remove more than $90\%$ of the PM couplings, while preserving the predictive and generative properties of the original dense PM, and the resulting model is far away from criticality, hence more robust to noise.

下载PDF全文

下载文献需遵守相关版权规定

论文标题