通过迭代分类增强短文聚类

论文标题

通过迭代分类增强短文聚类

Enhancement of Short Text Clustering by Iterative Classification

论文作者

Rakib, Md Rashadul Hasan, Zeh, Norbert, Jankowska, Magdalena, Milios, Evangelos

论文摘要

由于缺乏此类简短文本中的信号，因此短文本聚类是一项具有挑战性的任务。在这项工作中，我们提出迭代分类作为一种方法，以使短文的聚类质量（例如精度）。考虑到使用任意聚类算法获得的短文本的聚类，迭代分类应用异常去除以获得无离群的簇。然后，它根据其群集分布使用非外部词组来训练分类算法。使用训练有素的分类模型，迭代分类重新分类异常值，以获取一组新的簇。通过重复几次，我们获得了大量改进的文本聚类。我们的实验结果表明，所提出的聚类增强方法不仅可以提高不同聚类方法的聚类质量（例如K-均值，K-均值 - 和层次聚类），而且还优于最先进的短文本群集方法，这些简短的短文群集方法是通过统计上显着的元素在几个简短的文本数据集上使用的。

Short text clustering is a challenging task due to the lack of signal contained in such short texts. In this work, we propose iterative classification as a method to b o ost the clustering quality (e.g., accuracy) of short texts. Given a clustering of short texts obtained using an arbitrary clustering algorithm, iterative classification applies outlier removal to obtain outlier-free clusters. Then it trains a classification algorithm using the non-outliers based on their cluster distributions. Using the trained classification model, iterative classification reclassifies the outliers to obtain a new set of clusters. By repeating this several times, we obtain a much improved clustering of texts. Our experimental results show that the proposed clustering enhancement method not only improves the clustering quality of different clustering methods (e.g., k-means, k-means--, and hierarchical clustering) but also outperforms the state-of-the-art short text clustering methods on several short text datasets by a statistically significant margin.

下载PDF全文

下载文献需遵守相关版权规定

论文标题