论文标题

工程更快的双阵列Aho-Corasick Automata

Engineering faster double-array Aho-Corasick automata

论文作者

Kanda, Shunsuke, Akabe, Koichi, Oda, Yusuke

论文摘要

字符串中的多种模式匹配是文本处理应用程序(例如正则表达式或令牌化)中的一个基本问题。本文研究了双阵列AHO-Corasick Automata(DAACS)的有效实现,即快速执行多种模式匹配的数据结构。通过仔细设计数据结构,可以改善DAAC的实际性能,到目前为止,已经提出了许多实施技术。 DAACS中的一个问题是他们的想法没有汇总。由于无法进行全面的描述和实验分析,因此工程师在实施有效的DAAC方面面临困难。 在本文中,我们审查了DAACS的实施技术,并对它们进行了全面描述。我们还提出了几种新技术,以进一步改进。我们通过现实世界数据集进行了详尽的实验,并揭示了在DAAC中实现更高性能的最佳技术组合。最好的组合与最受欢迎的DAAC库中使用的组合不同,这表明它们的性能可以进一步提高。根据我们的实验分析,我们开发了一个新的Rust库,用于使用DAACS(名为Daachorse)的快速多种模式匹配,在https://github.com/daac-tools/daachorse上为开放源软件。实验表明,Daachorse的表现优于其他Ac-Automaton实现,这表明它是许多应用中多种模式匹配的快速替代性。

Multiple pattern matching in strings is a fundamental problem in text processing applications such as regular expressions or tokenization. This paper studies efficient implementations of double-array Aho-Corasick automata (DAACs), data structures for quickly performing the multiple pattern matching. The practical performance of DAACs is improved by carefully designing the data structure, and many implementation techniques have been proposed thus far. A problem in DAACs is that their ideas are not aggregated. Since comprehensive descriptions and experimental analyses are unavailable, engineers face difficulties in implementing an efficient DAAC. In this paper, we review implementation techniques for DAACs and provide a comprehensive description of them. We also propose several new techniques for further improvement. We conduct exhaustive experiments through real-world datasets and reveal the best combination of techniques to achieve a higher performance in DAACs. The best combination is different from those used in the most popular libraries of DAACs, which demonstrates that their performance can be further enhanced. On the basis of our experimental analysis, we developed a new Rust library for fast multiple pattern matching using DAACs, named Daachorse, as open-source software at https://github.com/daac-tools/daachorse. Experiments demonstrate that Daachorse outperforms other AC-automaton implementations, indicating its suitability as a fast alternative for multiple pattern matching in many applications.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源