论文标题

部分可观测时空混沌系统的无模型预测

Evaluation of Semantic Answer Similarity Metrics

论文作者

Mustafazade, Farida, Ebbinghaus, Peter F.

论文摘要

现有的通用机器翻译或自然语言生成评估指标有几个问题,在这种情况下,提问(QA)系统无动于衷。为了构建强大的质量检查系统,我们需要具有等效鲁棒评估系统的能力,以验证对问题的模型预测是否类似于地面真相注释。比较基于语义而不是纯字符串重叠的相似性的能力对于公平比较模型并指出现实生活应用中更现实的接受标准很重要。我们首先建立在我们的知识论文上,该论文使用基于变压器的模型指标来评估语义答案的相似性,并在没有词汇重叠的情况下与人类判断相关性更高的相关性。我们提出了跨编码器增强双重编码器和Bertscore模型,以进行语义答案相似性,该模型在新的数据集中进行了培训,该数据集由美国公共人物的名称对组成。就我们而言,我们提供了第一个共同参考名称字符串对的数据集及其相似性,可用于培训。

There are several issues with the existing general machine translation or natural language generation evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model predictions to questions are similar to ground-truth annotations. The ability to compare similarity based on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of US-American public figures. As far as we are concerned, we provide the first dataset of co-referent name string pairs along with their similarities, which can be used for training.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源