论文标题

使用信息增益痕迹进行自动聚类分析:Infoguide方法

Towards Automatic Clustering Analysis using Traces of Information Gain: The InfoGuide Method

论文作者

Rocha, Paulo, Pinheiro, Diego, Cadeiras, Martin, Bastos-Filho, Carmelo

论文摘要

聚类分析已成为普遍存在的信息检索工具,但仍缺乏更自动的框架。尽管内部指标是成功检索群集的关键参与者,但它们对现实数据集的有效性仍未得到充分了解,这主要是因为它们基于数据集的不现实假设。我们假设在日益复杂的聚类检索中捕获{\ IT信息增益} --- {\ it Infoguide} ---启用具有改进的聚类检索的自动聚类分析。我们通过使用Kolmogorov-Smirnov统计量来捕获信息增益的痕迹,并比较{\ it Infoguide}检索的群集与其他经常生成的内部指标在人为生成,基地标记和现实的DataSets中检索的群集,从而验证了{\ it Infoguide}假设。我们的结果表明,{\ it Infoguide}可以启用更自动的聚类分析,并且可能更适合在显示非平凡统计属性的实际数据集中检索集群。

Clustering analysis has become a ubiquitous information retrieval tool in a wide range of domains, but a more automatic framework is still lacking. Though internal metrics are the key players towards a successful retrieval of clusters, their effectiveness on real-world datasets remains not fully understood, mainly because of their unrealistic assumptions underlying datasets. We hypothesized that capturing {\it traces of information gain} between increasingly complex clustering retrievals---{\it InfoGuide}---enables an automatic clustering analysis with improved clustering retrievals. We validated the {\it InfoGuide} hypothesis by capturing the traces of information gain using the Kolmogorov-Smirnov statistic and comparing the clusters retrieved by {\it InfoGuide} against those retrieved by other commonly used internal metrics in artificially-generated, benchmarks, and real-world datasets. Our results suggested that {\it InfoGuide} can enable a more automatic clustering analysis and may be more suitable for retrieving clusters in real-world datasets displaying nontrivial statistical properties.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源