论文标题

具有结构预测的特定域特定单词嵌入

Domain-Specific Word Embeddings with Structure Prediction

论文作者

Brandl, Stephanie, Lassner, David, Baillot, Anne, Nakajima, Shinichi

论文摘要

补充了找到良好的通用单词嵌入,表示表示学习的一个重要问题是,在跨时间或域中找到动态的单词嵌入。当前方法不提供一种使用或预测亚群,时间或域之间结构的信息,而动态嵌入只能在对齐后进行比较。我们提出了新的单词嵌入方法,这些方法为整个语料库提供一般单词表示,每个子孔的特定于域特异性表示,子孔结构以及同时嵌入对齐。我们介绍了有关《纽约时报》文章和两个有关科学和哲学文章的英语Wikipedia数据集的经验评估。我们的方法称为结构预测(W2VPRED)的Word2Vec,就一般类比测试,特定于领域的类似学测试以及多个特定的单词嵌入评估以及结构预测性能而言,在没有赋予结构的情况下,提供了比基线更好的性能。作为数字人文科学领域的用例,我们演示了如何从德国文本档案中提出高文学的新研究问题。

Complementary to finding good general word embeddings, an important question for representation learning is to find dynamic word embeddings, e.g., across time or domain. Current methods do not offer a way to use or predict information on structure between sub-corpora, time or domain and dynamic embeddings can only be compared after post-alignment. We propose novel word embedding methods that provide general word representations for the whole corpus, domain-specific representations for each sub-corpus, sub-corpus structure, and embedding alignment simultaneously. We present an empirical evaluation on New York Times articles and two English Wikipedia datasets with articles on science and philosophy. Our method, called Word2Vec with Structure Prediction (W2VPred), provides better performance than baselines in terms of the general analogy tests, domain-specific analogy tests, and multiple specific word embedding evaluations as well as structure prediction performance when no structure is given a priori. As a use case in the field of Digital Humanities we demonstrate how to raise novel research questions for high literature from the German Text Archive.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源