论文标题
从单词嵌入中提取拓扑特征的一种新颖方法
A Novel Method of Extracting Topological Features from Word Embeddings
论文作者
论文摘要
近年来,拓扑数据分析已用于多种问题来处理高维噪声数据。尽管文本表示通常是高维和嘈杂的,但在自然语言处理中应用拓扑数据分析只有少量工作。在本文中,我们介绍了一种新颖的算法,以从文本的单词嵌入表示文本表示中提取拓扑特征,该特征可用于文本分类。拓扑数据分析在单词嵌入式上工作,可以解释嵌入的高维空间,并发现不同嵌入维度之间的关系。我们将使用持续的同源性,这是我们实验中最常见的拓扑数据分析工具。在研究长文本文档上的拓扑算法时,我们将显示我们所定义的拓扑特征可能会超过常规的文本挖掘功能。
In recent years, topological data analysis has been utilized for a wide range of problems to deal with high dimensional noisy data. While text representations are often high dimensional and noisy, there are only a few work on the application of topological data analysis in natural language processing. In this paper, we introduce a novel algorithm to extract topological features from word embedding representation of text that can be used for text classification. Working on word embeddings, topological data analysis can interpret the embedding high-dimensional space and discover the relations among different embedding dimensions. We will use persistent homology, the most commonly tool from topological data analysis, for our experiment. Examining our topological algorithm on long textual documents, we will show our defined topological features may outperform conventional text mining features.