置换术语评估可能是有问题的

论文标题

置换术语评估可能是有问题的

Pointwise Paraphrase Appraisal is Potentially Problematic

论文作者

Chen, Hannah, Ji, Yangfeng, Evans, David

论文摘要

训练和评估释义识别模型的盛行方法是作为二进制分类问题构建的：给出了模型一对句子，并通过将对对释义或非副词进行分类的准确程度来判断。这种基于重点的评估方法与大多数现实世界应用的目标不太匹配，因此我们工作的目的是了解在实践中表现良好的模型如何在实践中失败，并找到更好的方法来评估释义识别模型。作为实现该目标的第一步，我们表明，尽管通过将两个句子配对作为一个序列的释义识别的BERT进行微调BERT的标准方式导致模型与最先进的性能导致模型，但该模型可能会在简单任务上执行较差，例如识别对和两个相同句子的识别对。此外，我们表明，这些模型甚至可以预测一对具有释义得分的随机选择句子比一对相同的句子更高。

The prevailing approach for training and evaluating paraphrase identification models is constructed as a binary classification problem: the model is given a pair of sentences, and is judged by how accurately it classifies pairs as either paraphrases or non-paraphrases. This pointwise-based evaluation method does not match well the objective of most real world applications, so the goal of our work is to understand how models which perform well under pointwise evaluation may fail in practice and find better methods for evaluating paraphrase identification models. As a first step towards that goal, we show that although the standard way of fine-tuning BERT for paraphrase identification by pairing two sentences as one sequence results in a model with state-of-the-art performance, that model may perform poorly on simple tasks like identifying pairs with two identical sentences. Moreover, we show that these models may even predict a pair of randomly-selected sentences with higher paraphrase score than a pair of identical ones.

下载PDF全文

下载文献需遵守相关版权规定

论文标题