询问和回答问题以评估摘要的事实一致性

论文标题

询问和回答问题以评估摘要的事实一致性

Asking and Answering Questions to Evaluate the Factual Consistency of Summaries

论文作者

Wang, Alex, Cho, Kyunghyun, Lewis, Mike

论文摘要

抽象性摘要模型的实际应用受到有关其输入的频繁事实不一致的限制。现有的自动评估指标用于总结对此类错误不敏感。我们提出了一种称为QAGS（发音为“ kags”）的自动评估协议，该协议旨在识别生成的摘要中的事实不一致。 QAGS基于这样的直觉，即如果我们询问有关摘要及其来源的问题，如果该摘要实际上与源头一致，我们将收到类似的答案。为了评估QAGS，我们收集了对CNN/Dailymail（Hermann等，2015）和Xsum（Narayan等，2018）摘要数据集的模型生成摘要的事实一致性的人类判断。与其他自动评估指标相比，QAG与这些判断的相关性大大更高。此外，QAGS提供了一种自然的可解释性形式：在计算QAGS时产生的答案和问题表示摘要的哪些令牌不一致以及原因。我们认为QAGS是自动生成可用且实际上一致的文本的有前途的工具。

Practical applications of abstractive summarization models are limited by frequent factual inconsistencies with respect to their input. Existing automatic evaluation metrics for summarization are largely insensitive to such errors. We propose an automatic evaluation protocol called QAGS (pronounced "kags") that is designed to identify factual inconsistencies in a generated summary. QAGS is based on the intuition that if we ask questions about a summary and its source, we will receive similar answers if the summary is factually consistent with the source. To evaluate QAGS, we collect human judgments of factual consistency on model-generated summaries for the CNN/DailyMail (Hermann et al., 2015) and XSUM (Narayan et al., 2018) summarization datasets. QAGS has substantially higher correlations with these judgments than other automatic evaluation metrics. Also, QAGS offers a natural form of interpretability: The answers and questions generated while computing QAGS indicate which tokens of a summary are inconsistent and why. We believe QAGS is a promising tool in automatically generating usable and factually consistent text.

下载PDF全文

下载文献需遵守相关版权规定

论文标题