基于小数据集的神经数据与文本生成：比较大语言模型的两种半监督学习方法的附加值

论文标题

基于小数据集的神经数据与文本生成：比较大语言模型的两种半监督学习方法的附加值

Neural Data-to-Text Generation Based on Small Datasets: Comparing the Added Value of Two Semi-Supervised Learning Approaches on Top of a Large Language Model

论文作者

van der Lee, Chris, Ferreira, Thiago Castro, Emmery, Chris, Wiltshire, Travis, Krahmer, Emiel

论文摘要

这项研究讨论了半监督学习与验证的语言模型的效果，以生成数据到文本。当还补充大规模语言模型时，尚不清楚半监督学习是否仍然有用。这项研究的目的是通过将仅补充语言模型的数据到文本系统与两个数据到文本系统进行比较，这些系统还通过数据增强或伪标记的半手不见的学习方法来富含数据。结果表明，半监督学习会导致多样性指标的得分更高。在输出质量方面，使用伪标记方法扩展数据到文本系统的训练集确实提高了文本质量得分，但是数据增强方法在没有训练设置扩展的情况下得出了与系统相似的分数。这些结果表明，即使也存在语言模型，半监督的学习方法也可以增强产出质量和多样性。

This study discusses the effect of semi-supervised learning in combination with pretrained language models for data-to-text generation. It is not known whether semi-supervised learning is still helpful when a large-scale language model is also supplemented. This study aims to answer this question by comparing a data-to-text system only supplemented with a language model, to two data-to-text systems that are additionally enriched by a data augmentation or a pseudo-labeling semi-supervised learning approach. Results show that semi-supervised learning results in higher scores on diversity metrics. In terms of output quality, extending the training set of a data-to-text system with a language model using the pseudo-labeling approach did increase text quality scores, but the data augmentation approach yielded similar scores to the system without training set extension. These results indicate that semi-supervised learning approaches can bolster output quality and diversity, even when a language model is also present.

下载PDF全文

下载文献需遵守相关版权规定

论文标题