论文标题

使用字符串内核在16S rRNA基因测序数据集中建模系统发育

Modelling phylogeny in 16S rRNA gene sequencing datasets using string kernels

论文作者

Ish-Horowicz, Jonathan, Filippi, Sarah

论文摘要

细菌群落组成是使用16S rRNA(核糖体核糖酸)基因测序测量的,其中定义特征之一是变量之间存在的系统发育关系。在这里,我们通过采用最初在自然语言处理中提出的字符串内核来证明在两个统计任务(两个样本测试和主机性状预测)中建模这些关系的实用性。我们通过模拟研究表明,使用所提出的核的内核二样本测试明确模拟系统发育关系,同时对两个人群之间差异的系统发育规模也很敏感。我们还展示了如何将提出的内核与高斯过程一起使用,以提高宿主性状预测中的预测性能。我们的方法在Python软件包StringPhylo中实现(可在github.com/jonathanishhorowicz/stringphylo)中实现。

Bacterial community composition is measured using 16S rRNA (ribosomal ribonucleic acid) gene sequencing, for which one of the defining characteristics is the phylogenetic relationships that exist between variables. Here, we demonstrate the utility of modelling these relationships in two statistical tasks (the two sample test and host trait prediction) by employing string kernels originally proposed in natural language processing. We show via simulation studies that a kernel two-sample test using the proposed kernels, which explicitly model phylogenetic relationships, is powerful while also being sensitive to the phylogenetic scale of the difference between the two populations. We also demonstrate how the proposed kernels can be used with Gaussian processes to improve predictive performance in host trait prediction. Our method is implemented in the Python package StringPhylo (available at github.com/jonathanishhorowicz/stringphylo).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源