论文标题

从头读基因组组件的分布式记忆并行重叠生成

Distributed-Memory Parallel Contig Generation for De Novo Long-Read Genome Assembly

论文作者

Guidi, Giulia, Raulet, Gabriel, Rokhsar, Daniel, Oliker, Leonid, Yelick, Katherine, Buluc, Aydin

论文摘要

从冗余和错误的短序列中重建未知基因组的序列,是从头开始的基因组组装,是许多基因组学管道中的关键但计算密集的步骤。基因组数据的指数增长正在增加计算需求,并需要可扩展的高性能方法。在这项工作中,我们提出了一种新颖的分布式内存算法,该算法从基因组的字符串图表示并使用稀疏矩阵中生成重叠群集,即形成代表染色体区域的地图的重叠序列。使用矩阵抽象,我们掩盖了字符串图中的分支,并将连接的组件计算为属于同一线性链(即重叠群)的组基因组序列。然后,我们执行多路号分区,以最大程度地减少局部组件中的负载不平衡,即来自给定的重叠群的序列串联。基于通过分区获得的分配,我们计算诱导子图函数以在过程之间重新分布序列,从而导致一组局部稀疏矩阵。最后,我们使用深度优先搜索将每个矩阵遍历每个矩阵。我们的算法在128个节点上显示出良好的缩放,并平行效率高达80%,从而产生均匀的基因组覆盖范围,并在装配质量方面显示出令人鼓舞的结果。我们的重叠生成算法将组装过程定位,以大大减少此步骤上花费的计算量。我们的工作是在分布式内存中有效的大型基因组的从头读取组装的有效长期读取组件的一步。

De novo genome assembly, i.e., rebuilding the sequence of an unknown genome from redundant and erroneous short sequences, is a key but computationally intensive step in many genomics pipelines. The exponential growth of genomic data is increasing the computational demand and requires scalable, high-performance approaches. In this work, we present a novel distributed-memory algorithm that, from a string graph representation of the genome and using sparse matrices, generates the contig set, i.e., overlapping sequences that form a map representing a region of a chromosome. Using matrix abstraction, we mask branches in the string graph and compute the connected component to group genomic sequences that belong to the same linear chain (i.e., contig). Then, we perform multiway number partitioning to minimize the load imbalance in local assembly, i.e., concatenation of sequences from a given contig. Based on the assignment obtained by partitioning, we compute the induce subgraph function to redistribute sequences between processes, resulting in a set of local sparse matrices. Finally, we traverse each matrix using depth-first search to concatenate sequences. Our algorithm shows good scaling with parallel efficiency up to 80% on 128 nodes, resulting in uniform genome coverage and showing promising results in terms of assembly quality. Our contig generation algorithm localizes the assembly process to significantly reduce the amount of computation spent on this step. Our work is a step forward for efficient de novo long read assembly of large genomes in a distributed memory.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源