论文标题
反事实对NLI的忠实解释的逻辑满意度
Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI
论文作者
论文摘要
由于许多原因,例如信任,解释性和诊断模型错误的来源,因此需要评估解释的忠诚。在重点关注NLI任务的这项工作中,我们介绍了忠实性 - 直觉 - 相互作用的方法,该方法首先基于解释中表达的逻辑谓词产生反事实假设,然后评估模型对相反的逻辑的预测是否与表达的逻辑(即新表交)是否一致。与现有方法相反,这不需要任何解释来培训单独的验证模型。我们首先验证了自动反事实假设产生的功效,该假设产生了少量启动范式。接下来,我们表明我们提出的指标将人类模型的一致性与新的反事实输入分歧区分。此外,我们进行了灵敏度分析,以验证我们的指标对不忠的解释敏感。
Evaluating an explanation's faithfulness is desired for many reasons such as trust, interpretability and diagnosing the sources of model's errors. In this work, which focuses on the NLI task, we introduce the methodology of Faithfulness-through-Counterfactuals, which first generates a counterfactual hypothesis based on the logical predicates expressed in the explanation, and then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic (i.e. if the new formula is \textit{logically satisfiable}). In contrast to existing approaches, this does not require any explanations for training a separate verification model. We first validate the efficacy of automatic counterfactual hypothesis generation, leveraging on the few-shot priming paradigm. Next, we show that our proposed metric distinguishes between human-model agreement and disagreement on new counterfactual input. In addition, we conduct a sensitivity analysis to validate that our metric is sensitive to unfaithful explanations.