论文标题

GMETA:基于模板的正则表达式生成在嘈杂的示例上

gMeta: Template-based Regular Expression Generation over Noisy Examples

论文作者

Wang, Shujun, He, Yongqiang Tian andDengcheng

论文摘要

正则表达式(REGEXES)广泛用于计算机科学的不同领域,例如编程语言,字符串处理和数据库。但是,现有的用于合成或修理雷格斯的工具始终假定输入示例是没有故障的。在实际的工业场景中,这种假设并不完全存在。因此,本文提出了一种简单但有效的基于模板的方法,可以在嘈杂的示例中生成正则表达式。具体而言,我们提出了一个数据模型(即metaparam),以提取字符串的特征,以聚类所有示例。然后,我们提出了一个实用的动态阈值方案,通过检测CDF图上的膝关节过滤异常示例。最后,我们设计了一种基于模板的算法,以将有限的效用示例转化为正则表达式,该算法是有效,可解释且可扩展的。我们对应用于现实世界数据集的四个不同提取任务进行了实验评估,并在F量表方面获得了有希望的结果。此外,GMETA在实际的工业场景中取得了出色的成果。

Regular expressions (regexes) are widely used in different fields of computer science, such as programming languages, string processing, and databases. However, existing tools for synthesizing or repairing regexes always assume that the input examples are faultless. In real industrial scenarios, this assumption does not entirely hold. Thus, this paper presents a simple but effective templated-based approach to generate regular expressions over noisy examples. Specifically, we present a data model (i.e., MetaParam) to extract features of strings for clustering all examples. Then, we propose a practical dynamic thresholding scheme to filter out anomalous examples via detecting knee points on CDF graphs. Finally, we design a template-based algorithm to translate a finite of positve examples to regular expression, which is efficient, interpretable, and extensible. We performed an experimental evaluation on four different extraction tasks applied to real-world datasets and obtained promising results in terms of F-measure. Moreover, gMeta achieves excellent results in real industrial scenarios.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源