论文标题

SESQL:另一个大型会话级中文文本到SQL数据集

SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset

论文作者

Huang, Saihao, Wang, Lijie, Li, Zhenghua, Liu, Zeyang, Dou, Chenhui, Yan, Fukang, Xiao, Xinyan, Wu, Hua, Zhang, Min

论文摘要

作为第一个会话级的中文数据集,Chase包含两个单独的部分,即从Scratch(Chase-C)手动构建的2,003个会话,以及从英语SPARC(Chase-T)翻译的3,456个会话。我们发现这两个部分是高度差异,并且作为培训和评估数据不兼容。在这项工作中,我们介绍了SESQL,这是中文的另一个大型会话级文本到SQL数据集,由5,028次会话组成,为5,028个会话,都是从头手动构建的。为了保证数据质量,我们采用迭代注释工作流程,以促进对先前的自然语言(NL)问题和SQL查询的紧张和及时审查。此外,通过完成所有与上下文有关的NL问题,我们获得了27,012个与上下文无关的问题/SQL对,从而使SESQL可以用作单轮多DB文本到SQL解析的最大数据集。我们通过使用三个竞争性会话级解析器,并进行详细分析,对SESQL进行基准级课程级文本到SQL解析实验。

As the first session-level Chinese dataset, CHASE contains two separate parts, i.e., 2,003 sessions manually constructed from scratch (CHASE-C), and 3,456 sessions translated from English SParC (CHASE-T). We find the two parts are highly discrepant and incompatible as training and evaluation data. In this work, we present SeSQL, yet another large-scale session-level text-to-SQL dataset in Chinese, consisting of 5,028 sessions all manually constructed from scratch. In order to guarantee data quality, we adopt an iterative annotation workflow to facilitate intense and in-time review of previous-round natural language (NL) questions and SQL queries. Moreover, by completing all context-dependent NL questions, we obtain 27,012 context-independent question/SQL pairs, allowing SeSQL to be used as the largest dataset for single-round multi-DB text-to-SQL parsing. We conduct benchmark session-level text-to-SQL parsing experiments on SeSQL by employing three competitive session-level parsers, and present detailed analysis.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源