SESQL：另一个大型会话级中文文本到SQL数据集

论文标题

SESQL：另一个大型会话级中文文本到SQL数据集

SeSQL: Yet Another Large-scale Session-level Chinese Text-to-SQL Dataset

论文作者

Huang, Saihao, Wang, Lijie, Li, Zhenghua, Liu, Zeyang, Dou, Chenhui, Yan, Fukang, Xiao, Xinyan, Wu, Hua, Zhang, Min

论文摘要

作为第一个会话级的中文数据集，Chase包含两个单独的部分，即从Scratch（Chase-C）手动构建的2,003个会话，以及从英语SPARC（Chase-T）翻译的3,456个会话。我们发现这两个部分是高度差异，并且作为培训和评估数据不兼容。在这项工作中，我们介绍了SESQL，这是中文的另一个大型会话级文本到SQL数据集，由5,028次会话组成，为5,028个会话，都是从头手动构建的。为了保证数据质量，我们采用迭代注释工作流程，以促进对先前的自然语言（NL）问题和SQL查询的紧张和及时审查。此外，通过完成所有与上下文有关的NL问题，我们获得了27,012个与上下文无关的问题/SQL对，从而使SESQL可以用作单轮多DB文本到SQL解析的最大数据集。我们通过使用三个竞争性会话级解析器，并进行详细分析，对SESQL进行基准级课程级文本到SQL解析实验。

As the first session-level Chinese dataset, CHASE contains two separate parts, i.e., 2,003 sessions manually constructed from scratch (CHASE-C), and 3,456 sessions translated from English SParC (CHASE-T). We find the two parts are highly discrepant and incompatible as training and evaluation data. In this work, we present SeSQL, yet another large-scale session-level text-to-SQL dataset in Chinese, consisting of 5,028 sessions all manually constructed from scratch. In order to guarantee data quality, we adopt an iterative annotation workflow to facilitate intense and in-time review of previous-round natural language (NL) questions and SQL queries. Moreover, by completing all context-dependent NL questions, we obtain 27,012 context-independent question/SQL pairs, allowing SeSQL to be used as the largest dataset for single-round multi-DB text-to-SQL parsing. We conduct benchmark session-level text-to-SQL parsing experiments on SeSQL by employing three competitive session-level parsers, and present detailed analysis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题