论文标题
抽象神经摘要中的自我重复
Self-Repetition in Abstractive Neural Summarizers
论文作者
论文摘要
我们在神经摘要的输出中对自我重复进行定量和定性分析。我们测量自我重复作为在同一系统多个输出中出现的四个或更长时间的n克数量。我们分析了在五个数据集上进行微调的三个流行体系结构(Bart,T5和Pegasus)的行为。在回归分析中,我们发现这三个体系结构具有不同的倾向,可以在输出摘要中重复输出摘要,而BART特别容易自我重复。对更抽象的数据以及具有公式性语言的数据进行微调与更高的自我重复率有关。在定性分析中,我们发现系统产生了与所总结的内容无关的广告和免责声明,以及在微调领域中常见的公式性短语。我们对自我重复分析的语料库级分析方法可以帮助从业者清理培训数据以摘要者,并最终支持最大程度地减少自我重复量的方法。
We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5, and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language, is associated with a higher rate of self-repetition. In qualitative analysis we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus-level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.