论文标题
广泛:对语境化语言模型的价值评估语义测试
VAST: The Valence-Assessing Semantics Test for Contextualizing Language Models
论文作者
论文摘要
庞大的,量词评估语义测试是上下文化词嵌入(CWES)的新型内在评估任务。广阔的使用价,单词与愉悦的关联,以广泛使用的人类判断来衡量单词级LM语义的对应关系,并研究了情境化,令牌化和LM特异性几何形状的影响。由于先前的研究发现,来自GPT-2的CWE在其他固有评估上的表现较差,因此我们选择GPT-2作为我们的主要主题,并包括结果表明,GAST对其他7个LMS有用,并且可以用7种语言使用。 GPT-2结果表明,一个单词的语义将上下文的语义纳入更接近模型输出的上下文语义,使得我们的上下文设置之间的分数差异,从Pearson的Rho的0.55到.77到第11层中的设置。我们还显示,我们还表明,多重象征性的单词不在语言上,直到第8层都无法实现pearson的倍数。象征性的单词与单个标记的单词不同,而Rho在第0层中最高的单词。我们发现,一些具有值的神经元比GPT-2的上层中的REST蒙版单词级语义更大的神经元,但可以通过将非Ementical Principal Components恢复到pearson的rhos rhos rhos的33。在隔离语义之后,我们通过改进四个单词相似性任务的相关工作来展示广泛理解LM语义的实用性,在SIMLEX-999上的得分为0.50,比GPT-2的先前最佳效果优于.45。最后,我们表明,在10个偏压测试中,有8个比较了单词组之间的单词嵌入关联的差异,在隔离语义后表现出更多的刻板印象 - 一致性偏见,表明LMS中的非语义结构也掩盖了偏见。
VAST, the Valence-Assessing Semantics Test, is a novel intrinsic evaluation task for contextualized word embeddings (CWEs). VAST uses valence, the association of a word with pleasantness, to measure the correspondence of word-level LM semantics with widely used human judgments, and examines the effects of contextualization, tokenization, and LM-specific geometry. Because prior research has found that CWEs from GPT-2 perform poorly on other intrinsic evaluations, we select GPT-2 as our primary subject, and include results showing that VAST is useful for 7 other LMs, and can be used in 7 languages. GPT-2 results show that the semantics of a word incorporate the semantics of context in layers closer to model output, such that VAST scores diverge between our contextual settings, ranging from Pearson's rho of .55 to .77 in layer 11. We also show that multiply tokenized words are not semantically encoded until layer 8, where they achieve Pearson's rho of .46, indicating the presence of an encoding process for multiply tokenized words which differs from that of singly tokenized words, for which rho is highest in layer 0. We find that a few neurons with values having greater magnitude than the rest mask word-level semantics in GPT-2's top layer, but that word-level semantics can be recovered by nullifying non-semantic principal components: Pearson's rho in the top layer improves from .32 to .76. After isolating semantics, we show the utility of VAST for understanding LM semantics via improvements over related work on four word similarity tasks, with a score of .50 on SimLex-999, better than the previous best of .45 for GPT-2. Finally, we show that 8 of 10 WEAT bias tests, which compare differences in word embedding associations between groups of words, exhibit more stereotype-congruent biases after isolating semantics, indicating that non-semantic structures in LMs also mask biases.