论文标题
《不成文的语义:段落和序列令牌的末尾》对文本生成的影响
Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation with GPT2
论文作者
论文摘要
文本的语义不仅体现在读物上,而且还由未阅读的内容表现出来。在本文中,我们将研究隐式“不读”信息,例如段落(\ eop)和序列末端(\ eos)等信息。具体而言,我们发现,预先训练的语言模型GPT2可以通过学习在微调阶段生成\ eop来产生更好的连续性。英语故事的实验结果表明,\ eop可以导致更高的BLEU得分和较低的\ eos困惑。我们还使用中文gpt2(在预训练期间没有\ eop或\ eos的角色级别LM)上对中国文章数据集进行实验。实验结果表明,中国GPT2可以通过\ eop产生更好的论文结尾。
The semantics of a text is manifested not only by what is read, but also by what is not read. In this article, we will study how the implicit "not read" information such as end-of-paragraph (\eop) and end-of-sequence (\eos) affect the quality of text generation. Specifically, we find that the pre-trained language model GPT2 can generate better continuations by learning to generate the \eop in the fine-tuning stage. Experimental results on English story generation show that \eop can lead to higher BLEU score and lower \eos perplexity. We also conduct experiments on a self-collected Chinese essay dataset with Chinese-GPT2, a character level LM without \eop or \eos during pre-training. Experimental results show that the Chinese GPT2 can generate better essay endings with \eop.