论文标题
通过生物多样性优化的细胞基因组重建较细的重建
Finer Metagenomic Reconstruction via Biodiversity Optimization
论文作者
论文摘要
当从其测序DNA分析微生物的群落时,一个重要的任务是分类学分析:列举所有生物的存在和相对丰度,或者仅仅是所有分类群中包含的所有分类群。可以通过基于压缩感应的方法来解决此任务,该方法有利于与观察到的DNA数据一致的社区中最少的生物体。尽管取得了成功,但这些简约的方法有时通过忽视生物体的相似性来与生物现实主义冲突。在这里,我们利用了最近开发的生物多样性概念,同时说明了有机体的相似性,并保留了基于压缩感应方法的优化策略。我们证明,最小化生物学多样性仍然会产生稀疏的分类学概况,并且我们在实验上验证了与现有基于压缩感应方法的优势。尽管目的函数几乎永远不会凸起,而且通常会凹入,通常会产生NP障碍问题,但我们表现出表示有机体相似性的方式,可以通过一系列线性程序来保证降低多样性。更好的是,当生物学相似性通过$ k $ - 默认性(一种流行的生物信息学概念)量化时,最小化多样性实际上会降低到一个线性程序,该计划可以利用多个$ k $ - mer尺寸来增强性能。在概念验证实验中,我们验证后一种程序在分类学分类时,在重建准确性和计算性能方面都可以在分类学分类时会带来显着的收益。可再现的代码可在https://github.com/dkoslicki/minimizebiologicaldiversity上找到。
When analyzing communities of microorganisms from their sequenced DNA, an important task is taxonomic profiling: enumerating the presence and relative abundance of all organisms, or merely of all taxa, contained in the sample. This task can be tackled via compressive-sensing-based approaches, which favor communities featuring the fewest organisms among those consistent with the observed DNA data. Despite their successes, these parsimonious approaches sometimes conflict with biological realism by overlooking organism similarities. Here, we leverage a recently developed notion of biological diversity that simultaneously accounts for organism similarities and retains the optimization strategy underlying compressive-sensing-based approaches. We demonstrate that minimizing biological diversity still produces sparse taxonomic profiles and we experimentally validate superiority to existing compressive-sensing-based approaches. Despite showing that the objective function is almost never convex and often concave, generally yielding NP-hard problems, we exhibit ways of representing organism similarities for which minimizing diversity can be performed via a sequence of linear programs guaranteed to decrease diversity. Better yet, when biological similarity is quantified by $k$-mer co-occurrence (a popular notion in bioinformatics), minimizing diversity actually reduces to one linear program that can utilize multiple $k$-mer sizes to enhance performance. In proof-of-concept experiments, we verify that the latter procedure can lead to significant gains when taxonomically profiling a metagenomic sample, both in terms of reconstruction accuracy and computational performance. Reproducible code is available at https://github.com/dkoslicki/MinimizeBiologicalDiversity.