位置伪像通过蒙版语言模型嵌入传播

论文标题

位置伪像通过蒙版语言模型嵌入传播

Positional Artefacts Propagate Through Masked Language Model Embeddings

论文作者

Luo, Ziyang, Kulmizev, Artur, Mao, Xiaoxi

论文摘要

在这项工作中，我们证明了源自预识别的掩盖语言模型编码器的上下文化词向量共享一个常见的，也许是不良的模式。也就是说，我们发现伯特和罗伯塔隐藏的状态向量中持续存在的异常神经元的案例始终具有所述向量中最小或最大的值。为了研究此信息的来源，我们引入了一种神经元级分析方法，该方法表明，离群值与位置嵌入捕获的信息密切相关。我们还将Roberta-Base模型从头开始预先培训，并发现离群值在不使用位置嵌入的情况下消失。我们发现，这些异常值是编码器原始矢量空间各向异性的主要原因，并将它们剪切会导致矢量之间的相似性增加。我们在实践中证明了这一点，证明了剪裁的向量可以更准确地区分单词感官，并在平均池时会导致更好的句子嵌入。在三个有监督的任务中，我们发现剪裁不会影响性能。

In this work, we demonstrate that the contextualized word vectors derived from pretrained masked language model-based encoders share a common, perhaps undesirable pattern across layers. Namely, we find cases of persistent outlier neurons within BERT and RoBERTa's hidden state vectors that consistently bear the smallest or largest values in said vectors. In an attempt to investigate the source of this information, we introduce a neuron-level analysis method, which reveals that the outliers are closely related to information captured by positional embeddings. We also pre-train the RoBERTa-base models from scratch and find that the outliers disappear without using positional embeddings. These outliers, we find, are the major cause of anisotropy of encoders' raw vector spaces, and clipping them leads to increased similarity across vectors. We demonstrate this in practice by showing that clipped vectors can more accurately distinguish word senses, as well as lead to better sentence embeddings when mean pooling. In three supervised tasks, we find that clipping does not affect the performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题