平衡跨异质数据集的单词嵌入的组成

论文标题

平衡跨异质数据集的单词嵌入的组成

Balancing the composition of word embeddings across heterogenous data sets

论文作者

Brandl, Stephanie, Lassner, David, Alber, Maximilian

论文摘要

单词嵌入基于上下文信息捕获语义关系，并且是各种自然语言处理应用程序的基础。值得注意的是，这些关系完全是从数据中学到的，随后数据组成会影响嵌入的语义，这可以说会导致偏见的单词向量。鉴于质上不同的数据子集，我们旨在使单个子集对产生的单词向量的影响保持一致，同时保持其质量。在这方面，我们提出了一个标准，以衡量向单个数据子集的转变并开发出满足这两个目标的方法。我们发现，两个子集嵌入的加权平均值平衡了这些子集的影响，而单词相似性的性能下降。我们进一步提出了一种有希望的优化方法，以平衡单词嵌入的影响和质量。

Word embeddings capture semantic relationships based on contextual information and are the basis for a wide variety of natural language processing applications. Notably these relationships are solely learned from the data and subsequently the data composition impacts the semantic of embeddings which arguably can lead to biased word vectors. Given qualitatively different data subsets, we aim to align the influence of single subsets on the resulting word vectors, while retaining their quality. In this regard we propose a criteria to measure the shift towards a single data subset and develop approaches to meet both objectives. We find that a weighted average of the two subset embeddings balances the influence of those subsets while word similarity performance decreases. We further propose a promising optimization approach to balance influences and quality of word embeddings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题