通过灵活地纳入转盘表演，对对话状态跟踪进行公平评估

论文标题

通过灵活地纳入转盘表演，对对话状态跟踪进行公平评估

Towards Fair Evaluation of Dialogue State Tracking by Flexible Incorporation of Turn-level Performances

论文作者

Dey, Suvodip, Kummara, Ramamohan, Desarkar, Maunendra Sankar

论文摘要

对话状态跟踪（DST）主要使用定义为转弯部分的关节目标准确性（JGA）评估，在该转弯的一部分中，地面对话状态与预测完全匹配。通常，在DST中，给定回合的对话状态或信念状态包含用户直到转弯所显示的所有意图。由于信仰状态的累积性质，一旦发生错误预测，就很难获得正确的预测。因此，尽管是一个有用的指标，但有时可能会很苛刻，并低估了DST模型的真正潜力。此外，由于注释不一致，JGA的改善有时会降低转向级别或非肿瘤信念状态预测的性能。因此，使用JGA作为模型选择的唯一指标可能并不是所有情况的理想选择。在这项工作中，我们讨论了用于DST的各种评估指标及其缺点。为了解决现有问题，我们提出了一个新的评估指标，名为Flexible目标准确性（FGA）。 FGA是JGA的广义版本。但是与JGA不同，它试图给予惩罚奖励，以验证本地正确的错误预测，即错误的根本原因是较早的回合。通过这样做，FGA会灵活地考虑累积和转向级预测的性能，并提供比现有指标更好的见识。我们还表明，FGA是DST模型性能的更好歧视者。

Dialogue State Tracking (DST) is primarily evaluated using Joint Goal Accuracy (JGA) defined as the fraction of turns where the ground-truth dialogue state exactly matches the prediction. Generally in DST, the dialogue state or belief state for a given turn contains all the intents shown by the user till that turn. Due to this cumulative nature of the belief state, it is difficult to get a correct prediction once a misprediction has occurred. Thus, although being a useful metric, it can be harsh at times and underestimate the true potential of a DST model. Moreover, an improvement in JGA can sometimes decrease the performance of turn-level or non-cumulative belief state prediction due to inconsistency in annotations. So, using JGA as the only metric for model selection may not be ideal for all scenarios. In this work, we discuss various evaluation metrics used for DST along with their shortcomings. To address the existing issues, we propose a new evaluation metric named Flexible Goal Accuracy (FGA). FGA is a generalized version of JGA. But unlike JGA, it tries to give penalized rewards to mispredictions that are locally correct i.e. the root cause of the error is an earlier turn. By doing so, FGA considers the performance of both cumulative and turn-level prediction flexibly and provides a better insight than the existing metrics. We also show that FGA is a better discriminator of DST model performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题