论文标题
使用自我播放增强多转移的文本到SQL数据集
Augmenting Multi-Turn Text-to-SQL Datasets with Self-Play
论文作者
论文摘要
与上下文相关的文本到SQL的任务旨在将多转弯用户的话语转换为正式的SQL查询。这是一项艰巨的任务,因为既缺乏学习复杂的上下文依赖性并推广到看不见数据库的培训数据的稀缺性。在本文中,我们使用自我播放来探索增强培训数据集的增强,该播放利用上下文信息综合了新的交互,以使模型适应新数据库。我们首先设计了一个在采样的目标查询中进行调节的SQL-TOXT模型,该模型代表用户的意图,然后与文本到SQL语义解析器对话以生成新的交互。然后,我们过滤综合的交互作用,并使用增强数据重新训练模型。我们发现,自我播放提高了在SPARC和COSQL上的强基线的准确性,这是两个广泛使用的跨域文本到SQL数据集。我们的分析表明,自我播放模拟各种对话主题关系,增强跨域的概括并改善光束搜索。
The task of context-dependent text-to-SQL aims to convert multi-turn user utterances to formal SQL queries. This is a challenging task due to both the scarcity of training data from which to learn complex contextual dependencies and to generalize to unseen databases. In this paper we explore augmenting the training datasets using self-play, which leverages contextual information to synthesize new interactions to adapt the model to new databases. We first design a SQL-to-text model conditioned on a sampled goal query, which represents a user's intent, that then converses with a text-to-SQL semantic parser to generate new interactions. We then filter the synthesized interactions and retrain the models with the augmented data. We find that self-play improves the accuracy of a strong baseline on SParC and CoSQL, two widely used cross-domain text-to-SQL datasets. Our analysis shows that self-play simulates various conversational thematic relations, enhances cross-domain generalization and improves beam-search.