如何评估对话系统：探测任务是代币级别评估指标的替代方案

论文标题

如何评估对话系统：探测任务是代币级别评估指标的替代方案

How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

论文作者

Parthasarathi, Prasanna, Pineau, Joelle, Chandar, Sarath

论文摘要

尽管生成对话建模被广泛视为语言建模任务，但该任务要求代理对其输入文本具有复杂的自然语言理解，以与用户进行有意义的互动。使用的自动指标评估了生成的文本的质量，以代理代理的整体相互作用。此类指标先前被证明与人类判断无关。在这项工作中，我们观察到，由于缺乏足够的信息无法进行适当的评估，对对话代理的人类评估可能是不确定的。自动指标是确定性但浅的，人类评估可能是相关但尚无定论的。为了弥合评估差距，我们建议设计一组探测任务以评估对话模型。手工制作的任务旨在定量评估生成对话模型的理解，而不是对生成文本的令牌评估。探测任务是确定性的，就像自动指标一样，需要人类的设计判断。从两全其美的世界中受益。通过对探针任务的实验，我们观察到，与基于RNN的体系结构不同，Transformer模型可能无法学习理解输入文本，尽管其生成的文本与目标文本具有较高的重叠。

Though generative dialogue modeling is widely seen as a language modeling task, the task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user. The automatic metrics used evaluate the quality of the generated text as a proxy to the holistic interaction of the agent. Such metrics were earlier shown to not correlate with the human judgement. In this work, we observe that human evaluation of dialogue agents can be inconclusive due to the lack of sufficient information for appropriate evaluation. The automatic metrics are deterministic yet shallow and human evaluation can be relevant yet inconclusive. To bridge this gap in evaluation, we propose designing a set of probing tasks to evaluate dialogue models. The hand-crafted tasks are aimed at quantitatively evaluating a generative dialogue model's understanding beyond the token-level evaluation on the generated text. The probing tasks are deterministic like automatic metrics and requires human judgement in their designing; benefiting from the best of both worlds. With experiments on probe tasks we observe that, unlike RNN based architectures, transformer model may not be learning to comprehend the input text despite its generated text having higher overlap with the target text.

下载PDF全文

下载文献需遵守相关版权规定

论文标题