论文标题
帕累托的最佳基因组词典压缩,在主内存中有或没有随机访问
Pareto Optimal Compression of Genomic Dictionaries, with or without Random Access in Main Memory
论文作者
论文摘要
动机:基因组词典,即出现在基因组中的K-MER的集合,是基因组信息的基本来源:其收集是从组装到序列比较和系统发育的战略计算方法的第一步。不幸的是,存储成本很高。这激发了一些有关这些K-MER集合的压缩的最新研究。但是,这样的区域没有基因组压缩的成熟度,缺乏同质且具有方法论上合理的实验基础,该基础允许公平地比较可用溶液的相对优点,并且还考虑了可以使用的压缩方法的丰富选择。 结果:我们在这里提供了这样的基础,并通过一组使用参考数据集和精心选择的代表性数据压缩机组的大量实验进行了支持。我们的结果突出了一个人在压缩的最佳性和后处理方面具有压缩机选择的范围,当字典需要多次解压缩时,后者很重要。除了在其他地方无法使用的有用迹象外,该研究还向有兴趣以压缩形式存储K-MER词典的研究人员还提供了一种软件系统,该软件系统可容易用于探索帕累托最佳解决方案,还提供了给定词典。 可用性:该软件系统可在https://github.com/gengrim76/pareto-optimal-gdc以及用户手册和安装说明中获得。 联系人:[email protected] 补充信息:补充材料中提供其他数据。
Motivation: A Genomic Dictionary, i.e., the set of the k-mers appearing in a genome, is a fundamental source of genomic information: its collection is the first step in strategic computational methods ranging from assembly to sequence comparison and phylogeny. Unfortunately, it is costly to store. This motivates some recent studies regarding the compression of those k-mer sets. However, such an area does not have the maturity of genomic compression, lacking an homogeneous and methodologically sound experimental foundation that allows to fairly compare the relative merits of the available solutions, and that takes into account also the rich choices of compression methods that can be used. Results: We provide such a foundation here, supporting it with an extensive set of experiments that use reference datasets and a carefully selected set of representative data compressors. Our results highlight the spectrum of compressor choices one has in terms of Pareto Optimality of compression vs. post-processing, this latter being important when the Dictionary needs to be decompressed many times. In addition to the useful indications, not available elsewhere, that this study offers to the researchers interested in storing k-mer dictionaries in compressed form, a software system that can be readily used to explore the Pareto Optimal solutions available r a given Dictionary is also provided. Availability: The software system is available at https://github.com/GenGrim76/Pareto-Optimal-GDC, together with user manuals and installation instructions. Contact: [email protected] Supplementary information: Additional data are available in the Supplementary Material.