伪log类的应用在自然语言评分上

论文标题

伪log类的应用在自然语言评分上

An Application of Pseudo-Log-Likelihoods to Natural Language Scoring

论文作者

Abramson, Darren, Emami, Ali

论文摘要

语言模型使用半监督的机器学习建立在大型自然语言上，很快就笼罩着自然语言的产生和理解领域。在本文中，我们采用了许多由许多研究人员独立开发的零镜方法，现在获得了认可，作为对常识任务进行评估的重要替代方法。与最新的语言模型（T5）相比，具有相对较少参数和训练步骤的语言模型可以在最近的大型数据集（时机）上胜过它，同时在类似的语言任务中表现出稳健性。令人惊讶的是，与较大模型相比，通过使用与较小模型的无高参数零射击方法来实现此结果。我们认为，较小模型的鲁棒性应该从组成性来理解，从最近的文献中汲取了一系列类似模型的文献。我们确定了我们的方法和模型的实际成本：用于自然语言评估的高GPU时间。对于Albert和其他BERT变体而言，零射击测量技术的应用是伪log类型的应用，用于掩盖语言模型，以相对测量强制选择语言任务中替代替代方案的相对概率，例如Winograd Schema Challenge，Winograd Schema Challenge，Winograngand和其他。本文的一项贡献是汇集了许多类似但独立的研究。我们为二元选择任务中的常识推理提供了一些绝对的最新结果，比文献中的任何已发表的结果都更好，包括微调的努力。我们在对抗性设置下表现出模型性能的显着一致性，我们认为这是模型的表示形式最好地解释的。

Language models built using semi-supervised machine learning on large corpora of natural language have very quickly enveloped the fields of natural language generation and understanding. In this paper we apply a zero-shot approach independently developed by a number of researchers now gaining recognition as a significant alternative to fine-tuning for evaluation on common sense tasks. A language model with relatively few parameters and training steps compared to a more recent language model (T5) can outperform it on a recent large data set (TimeDial), while displaying robustness in its performance across a similar class of language tasks. Surprisingly, this result is achieved by using a hyperparameter-free zero-shot method with the smaller model, compared to fine-tuning to the larger model. We argue that robustness of the smaller model ought to be understood in terms of compositionality, in a sense that we draw from recent literature on a class of similar models. We identify a practical cost for our method and model: high GPU-time for natural language evaluation. The zero-shot measurement technique that produces remarkable stability, both for ALBERT and other BERT variants, is an application of pseudo-log-likelihoods to masked language models for the relative measurement of probability for substitution alternatives in forced choice language tasks such as the Winograd Schema Challenge, Winogrande, and others. One contribution of this paper is to bring together a number of similar, but independent strands of research. We produce some absolute state-of-the-art results for common sense reasoning in binary choice tasks, performing better than any published result in the literature, including fine-tuned efforts. We show a remarkable consistency of the model's performance under adversarial settings, which we argue is best explained by the model's compositionality of representations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题