在低资源设置中使用合成数据进行会话响应的生成

论文标题

在低资源设置中使用合成数据进行会话响应的生成

Using Synthetic Data for Conversational Response Generation in Low-resource Settings

论文作者

Tan, Gabriel Louis, Ty, Adrian Paule, Ng, Schuyler, Co, Denzel Adrian, Cruz, Jan Christian Blaise, Cheng, Charibeth

论文摘要

响应生成是自然语言处理（NLP）的一项任务，其中训练了模型以应对人类陈述。会话响应生成器将这一步骤进一步，具有在先前响应的背景下响应的能力。尽管有用于培训此类模型的现有技术，但它们都需要大量的对话数据，这些数据并不总是可用于低资源语言。在这项研究中，我们做出了三项贡献。首先，我们发布了第一个从一个受欢迎的菲律宾在线论坛中收集的菲律宾对话数据集，我们将其命名为PEX对话数据集。其次，我们通过使用他加禄语罗伯塔模型来增加现有语料库的规模来引入菲律宾数据的数据增强方法（DA）方法。最后，我们发布了第一个菲律宾对话响应生成器，能够生成与前3个响应有关的响应。借助补充合成数据，与使用零合成数据的培训相比，我们能够在Bertscore中提高响应生成器的性能高达12.2％，困惑性10.7％，内容单词使用率为11.7％。

Response generation is a task in natural language processing (NLP) where a model is trained to respond to human statements. Conversational response generators take this one step further with the ability to respond within the context of previous responses. While there are existing techniques for training such models, they all require an abundance of conversational data which are not always available for low-resource languages. In this research, we make three contributions. First, we released the first Filipino conversational dataset collected from a popular Philippine online forum, which we named the PEx Conversations Dataset. Second, we introduce a data augmentation (DA) methodology for Filipino data by employing a Tagalog RoBERTa model to increase the size of the existing corpora. Lastly, we published the first Filipino conversational response generator capable of generating responses related to the previous 3 responses. With the supplementary synthetic data, we were able to improve the performance of the response generator by up to 12.2% in BERTScore, 10.7% in perplexity, and 11.7% in content word usage as compared to training with zero synthetic data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题