论文标题

与复杂结构的基准测试多模式正则综合

Benchmarking Multimodal Regex Synthesis with Complex Structures

论文作者

Ye, Xi, Chen, Qiaochu, Dillig, Isil, Durrett, Greg

论文摘要

从自然语言中生成正则表达(REGEX)的现有数据集的复杂性有限;与用户在Stackoverflow上发布的Regex任务相比,这些数据集中的Regexes很简单,用于描述它们的语言并不多样化。我们介绍了结构Regex,这是一种新的正则综合数据集,在三个方面与先前的综合数据集不同。首先,为了获得结构复杂且现实的重视,我们使用概率的语法生成了回答者,该语法具有从真实世界stackoverflow文章中观察到的预定宏的宏。其次,为了获得语言上多样化的自然语言描述,我们展示了人群工人对基础正则表达式的抽象描述,并要求他们描述他们所看到的模式,而不是让他们解释合成语言。第三,我们通过与真实用户如何给出示例的方式相似,将每个正则示例示例增强了每个正则示例。我们的定量和定性分析证明了结构Regex比先前数据集的优势。使用各种多模式合成技术的进一步实验结果突出了我们数据集提出的挑战,包括非本地约束和多模式输入。

Existing datasets for regular expression (regex) generation from natural language are limited in complexity; compared to regex tasks that users post on StackOverflow, the regexes in these datasets are simple, and the language used to describe them is not diverse. We introduce StructuredRegex, a new regex synthesis dataset differing from prior ones in three aspects. First, to obtain structurally complex and realistic regexes, we generate the regexes using a probabilistic grammar with pre-defined macros observed from real-world StackOverflow posts. Second, to obtain linguistically diverse natural language descriptions, we show crowdworkers abstract depictions of the underlying regex and ask them to describe the pattern they see, rather than having them paraphrase synthetic language. Third, we augment each regex example with a collection of strings that are and are not matched by the ground truth regex, similar to how real users give examples. Our quantitative and qualitative analysis demonstrates the advantages of StructuredRegex over prior datasets. Further experimental results using various multimodal synthesis techniques highlight the challenge presented by our dataset, including non-local constraints and multi-modal inputs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源