论文标题
用于系统发育分析的有效算法库
Library of efficient algorithms for phylogenetic analysis
论文作者
论文摘要
通常通过系统发育分析来推断物种之间的进化关系,该分析提供了由测序序列的特定区域构建的等位基因剖面计算出的系统发育树,并将其抽象为分类索引。随着人们和商品的越来越多的交流,流行病变得越来越重要,结合特定国家数据集的信息现在可以揭示未知的传播模式。系统发育分析工作流程主要由四个连续的步骤组成:距离计算,距离校正,推理算法和局部优化步骤。那里有许多系统发育工具,但是大多数仅实施其中的一些步骤,并且仅具有一个单一的目的,例如一种类型的算法。这些问题是它们通常很难使用和集成,因为它们每个都有自己的API。该项目导致了一个库,该库实现了工作流的四个步骤,并且具有高度性能,可扩展,可重复使用和便携,同时提供了常见的API和文档。它也与其他工具不同,从某种意义上说,它能够在用户想要的时候停止并恢复工作流程,并且它是不断扩展的,而不仅仅是一个单一的目的。该库上进行的时间基准表明,其算法的实现符合其理论时间的复杂性。同时,内存基准显示了NJ算法的实现遵循线性内存复杂性,而MST和GCP算法的实现遵循二次内存复杂性。
Evolutionary relationships between species are usually inferred through phylogenetic analysis, which provides phylogenetic trees computed from allelic profiles built by sequencing specific regions of the sequences and abstracting them to categorical indexes. With growing exchanges of people and merchandise, epidemics have become increasingly important, and combining information of country-specific datasets can now reveal unknown spreading patterns. The phylogenetic analysis workflow is mainly composed of four consecutive steps, the distance calculation, distance correction, inference algorithm, and local optimization steps. There are many phylogenetic tools out there, however most implement only some of these steps and serve only one single purpose, such as one type of algorithm. Another problem with these is that they are often hard to use and integrate, since each of them has its own API. This project resulted a library that implements the four steps of the workflow, and is highly performant, extensible, reusable, and portable, while providing common APIs and documentation. It also differs from other tools in the sense that, it is able to stop and resume the workflow whenever the user desires, and it was built to be continuously extended and not just serve a single purpose. The time benchmarks conducted on this library show that its implementations of the algorithms conform to their theoretical time complexity. Meanwhile, the memory benchmarks showcase that the implementations of the NJ algorithms follow a linear memory complexity, while the implementations of the MST and GCP algorithms follow a quadratic memory complexity.