论文标题
RICA:根据常识公理评估强大的推理能力
RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms
论文作者
论文摘要
培训预训练的语言模型(PTLM)在常识推理基准上取得了令人印象深刻的表现,但是他们使用常识来进行强大的推论的能力是对与人类有效沟通至关重要的。为了追求发展流体的人类交流,我们提出了一个新的挑战,RICA:基于常识公理的强大推论能力,尽管文本扰动,但仍能评估强大的常识性推理。为了生成这一挑战的数据,我们使用常识性知识库和探测PTLM在两个不同的评估设置中制定了系统,可扩展的过程。在我们生成的探针集上具有超过10K语句的广泛实验表明,PTLM的表现不像在零射击设置上随机猜测更好,受到统计偏见的影响,并且对扰动攻击并不强大。我们还发现,在类似陈述上进行微调提供了有限的收益,因为PTLMS仍然无法概括地看不见的推论。我们的新大规模基准揭示了PTLMS和人类水平的语言理解之间的显着差距,并为PTLMS展示常识提供了新的挑战。
Pre-trained language models (PTLMs) have achieved impressive performance on commonsense inference benchmarks, but their ability to employ commonsense to make robust inferences, which is crucial for effective communications with humans, is debated. In the pursuit of advancing fluid human-AI communication, we propose a new challenge, RICA: Robust Inference capability based on Commonsense Axioms, that evaluates robust commonsense inference despite textual perturbations. To generate data for this challenge, we develop a systematic and scalable procedure using commonsense knowledge bases and probe PTLMs across two different evaluation settings. Extensive experiments on our generated probe sets with more than 10k statements show that PTLMs perform no better than random guessing on the zero-shot setting, are heavily impacted by statistical biases, and are not robust to perturbation attacks. We also find that fine-tuning on similar statements offer limited gains, as PTLMs still fail to generalize to unseen inferences. Our new large-scale benchmark exposes a significant gap between PTLMs and human-level language understanding and offers a new challenge for PTLMs to demonstrate commonsense.