预测培训语言模型

论文标题

预测培训语言模型

Predictions For Pre-training Language Models

论文作者

Guo, Tong

论文摘要

事实证明，语言模型预培训在许多语言理解任务中很有用。在本文中，我们研究在训练步骤和微调步骤中添加自我训练方法是否仍然有用。为了实现这一目标，我们提出了一个学习框架，该框架可以最好地利用低资源和高资源标签数据集上的未标记数据。在行业NLP应用程序中，我们有大量用户或客户生产的数据。我们的学习框架基于大量未标记数据。首先，我们使用在手动标记数据集上进行微调的模型来预测用户生成的未标记数据的伪标签。然后，我们使用伪标签来监督有关大量用户生成数据的特定任务培训。我们将伪标签上的特定任务培训步骤视为下一个微调步骤的预训练步骤。最后，我们在预先训练的模型上微调了手动标记的数据集。在这项工作中，我们首先从经验上表明，当手动标记的微调数据集相对较小时，我们的方法能够稳固地提高性能3.6％。然后，我们还表明，当手动标记的微调数据集相对较大时，我们的方法仍然能够进一步提高0.2％。我们认为我们的方法可以充分利用未标记数据，而单独的训练或自我培训优于培训。

Language model pre-training has proven to be useful in many language understanding tasks. In this paper, we investigate whether it is still helpful to add the self-training method in the pre-training step and the fine-tuning step. Towards this goal, we propose a learning framework that making best use of the unlabel data on the low-resource and high-resource labeled dataset. In industry NLP applications, we have large amounts of data produced by users or customers. Our learning framework is based on this large amounts of unlabel data. First, We use the model fine-tuned on manually labeled dataset to predict pseudo labels for the user-generated unlabeled data. Then we use the pseudo labels to supervise the task-specific training on the large amounts of user-generated data. We consider this task-specific training step on pseudo labels as a pre-training step for the next fine-tuning step. At last, we fine-tune on the manually labeled dataset upon the pre-trained model. In this work, we first empirically show that our method is able to solidly improve the performance by 3.6%, when the manually labeled fine-tuning dataset is relatively small. Then we also show that our method still is able to improve the performance further by 0.2%, when the manually labeled fine-tuning dataset is relatively large enough. We argue that our method make the best use of the unlabel data, which is superior to either pre-training or self-training alone.

下载PDF全文

下载文献需遵守相关版权规定

论文标题