论文标题
知识上的知识:迈向知识渊博的半参数模型
Knowledge-in-Context: Towards Knowledgeable Semi-Parametric Language Models
论文作者
论文摘要
完全参数的语言模型通常需要大量的模型参数来存储在零/少数拍摄设置中求解多个自然语言任务的必要知识。此外,如果不昂贵的模型重新训练,很难适应不断发展的世界知识。在本文中,我们开发了一种新颖的半参数模型体系结构,即知识上的知识(KIC),该模型构建具有知识丰富的外部内存的参数文本对文本语言模型。具体而言,外部内存包含六种不同类型的知识:实体,字典,常识,事件,脚本和因果关系知识。对于每个输入实例,KIC模型都会自适应地选择知识类型并检索最有用的知识。输入实例及其知识增强被馈入文本到文本模型(例如T5)以生成输出答案,在此提示后,输入和输出均以自然语言形式形式。有趣的是,我们发现KIC可以识别为特殊的专家(MOE)模型,其中知识选择器扮演用于确定MOE中序列到专家分配的路由器的作用。该关键观察激发了我们开发一种新颖的算法,用于使用实例自适应知识选择者培训KIC。作为知识丰富的半参数语言模型,KIC只需要一个小得多的参数部分即可在看不见的任务上实现出色的零击性能。通过评估40多个不同的任务,我们表明具有770m参数的KIC_LARGE很容易比大于4-39x大的大语言模型(LMS)大。我们还证明,与完全参数模型相比,KIC具有较小的模型量表的新兴能力。
Fully-parametric language models generally require a huge number of model parameters to store the necessary knowledge for solving multiple natural language tasks in zero/few-shot settings. In addition, it is hard to adapt to the evolving world knowledge without the costly model re-training. In this paper, we develop a novel semi-parametric language model architecture, Knowledge-in-Context (KiC), which empowers a parametric text-to-text language model with a knowledge-rich external memory. Specifically, the external memory contains six different types of knowledge: entity, dictionary, commonsense, event, script, and causality knowledge. For each input instance, the KiC model adaptively selects a knowledge type and retrieves the most helpful pieces of knowledge. The input instance along with its knowledge augmentation is fed into a text-to-text model (e.g., T5) to generate the output answer, where both the input and the output are in natural language forms after prompting. Interestingly, we find that KiC can be identified as a special mixture-of-experts (MoE) model, where the knowledge selector plays the role of a router that is used to determine the sequence-to-expert assignment in MoE. This key observation inspires us to develop a novel algorithm for training KiC with an instance-adaptive knowledge selector. As a knowledge-rich semi-parametric language model, KiC only needs a much smaller parametric part to achieve superior zero-shot performance on unseen tasks. By evaluating on 40+ different tasks, we show that KiC_Large with 770M parameters easily outperforms large language models (LMs) that are 4-39x larger by a large margin. We also demonstrate that KiC exhibits emergent abilities at a much smaller model scale compared to the fully-parametric models.