论文标题
变压器进料层通过在词汇空间中促进概念来建立预测
Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space
论文作者
论文摘要
基于变压器的语言模型(LMS)是现代NLP的核心,但是它们的内部预测构建过程是不透明的,在很大程度上不了解。在这项工作中,我们通过逆转馈电网络(FFN)层的操作,这是变压器模型的构建块之一,迈出了朝着揭示这一基本预测过程的重大步骤。我们将令牌表示视为词汇上的变化分布,而来自每个FFN层的输出是该分布的添加剂更新。然后,我们分析了词汇空间中的FFN更新,表明每个更新都可以分解为与单个FFN参数向量相对应的子更新,每个更新通常促进通常是人解剖的概念。然后,我们利用这些发现来控制LM预测,其中我们将GPT2的毒性降低了近50%,并通过简单的早期退出规则提高计算效率,平均节省了20%的计算。
Transformer-based language models (LMs) are at the core of modern NLP, but their internal prediction construction process is opaque and largely not understood. In this work, we make a substantial step towards unveiling this underlying prediction process, by reverse-engineering the operation of the feed-forward network (FFN) layers, one of the building blocks of transformer models. We view the token representation as a changing distribution over the vocabulary, and the output from each FFN layer as an additive update to that distribution. Then, we analyze the FFN updates in the vocabulary space, showing that each update can be decomposed to sub-updates corresponding to single FFN parameter vectors, each promoting concepts that are often human-interpretable. We then leverage these findings for controlling LM predictions, where we reduce the toxicity of GPT2 by almost 50%, and for improving computation efficiency with a simple early exit rule, saving 20% of computation on average.