论文标题
形态跳过:使用形态学知识来改善单词表示
Morphological Skip-Gram: Using morphological knowledge to improve word representation
论文作者
论文摘要
自然语言处理模型引起了深度学习社区的极大兴趣。该研究分支由某些应用组成,例如机器翻译,情感分析,命名实体识别,问答等。单词嵌入是连续的单词表示,它们是这些应用程序的重要模块,通常用作深度学习模型的输入单词表示。 Word2Vec和Glove是学习单词嵌入的两种流行方法。他们获得了好的单词表示形式,但是,他们以有限的信息来学习表示形式,因为他们忽略了单词的形态信息,而仅考虑每个单词的一个表示向量。这种方法意味着Word2Vec和手套并不意识到“内在结构”一词。为了减轻此问题,FastText模型将每个单词表示为字符n-grams的袋子。因此,每个n-gram都有一个连续的向量表示形式,最终单词表示是其字符n-grams向量的总和。然而,单词的所有n-grams特征的使用都是一种糟糕的方法,因为某些n-grams与单词没有语义关系,并且增加了潜在的无用信息的量。这种方法还增加了训练阶段时间。在这项工作中,我们提出了一种训练单词嵌入的新方法,其目标是通过单词的形态学分析来代替一袋字符n-grams的快速文本袋。因此,具有相似上下文和词素的单词由彼此亲近的向量表示。为了评估我们的新方法,我们考虑了15个不同的任务,进行了内在的评估,与FastText相比,结果表明竞争性能。
Natural language processing models have attracted much interest in the deep learning community. This branch of study is composed of some applications such as machine translation, sentiment analysis, named entity recognition, question and answer, and others. Word embeddings are continuous word representations, they are an essential module for those applications and are generally used as input word representation to the deep learning models. Word2Vec and GloVe are two popular methods to learn word embeddings. They achieve good word representations, however, they learn representations with limited information because they ignore the morphological information of the words and consider only one representation vector for each word. This approach implies that Word2Vec and GloVe are unaware of the word inner structure. To mitigate this problem, the FastText model represents each word as a bag of characters n-grams. Hence, each n-gram has a continuous vector representation, and the final word representation is the sum of its characters n-grams vectors. Nevertheless, the use of all n-grams character of a word is a poor approach since some n-grams have no semantic relation with their words and increase the amount of potentially useless information. This approach also increases the training phase time. In this work, we propose a new method for training word embeddings, and its goal is to replace the FastText bag of character n-grams for a bag of word morphemes through the morphological analysis of the word. Thus, words with similar context and morphemes are represented by vectors close to each other. To evaluate our new approach, we performed intrinsic evaluations considering 15 different tasks, and the results show a competitive performance compared to FastText.