简单英文Wikipedia文章的拓扑数据分析

论文标题

简单英文Wikipedia文章的拓扑数据分析

Topological Data Analysis on Simple English Wikipedia Articles

论文作者

Wright, Matthew, Zheng, Xiaojun

论文摘要

单参数持续的同源性是拓扑数据分析的关键工具，已广泛应用于数据问题以及量化结果重要性的统计技术。相反，几乎没有考虑到两参数持久性的统计技术，尽管对现实世界的应用非常理想。我们提出了三种统计方法，用于使用两参数持续的同源性比较几何数据。这些方法依赖于希尔伯特函数，匹配距离和从点云数据计算出的两参数持久模块获得的条形码。我们的统计方法广泛适用于分析由实价参数索引的几何数据。我们应用这些方法来分析从简单的英语Wikipedia文章获得的高维点云数据。特别是，我们展示了如何利用我们的方法来区分Wikipedia数据的某些子集并与随机数据进行比较。这些结果可以洞悉无效分布的构建以及我们方法对嘈杂数据的稳定性。

Single-parameter persistent homology, a key tool in topological data analysis, has been widely applied to data problems along with statistical techniques that quantify the significance of the results. In contrast, statistical techniques for two-parameter persistence, while highly desirable for real-world applications, have scarcely been considered. We present three statistical approaches for comparing geometric data using two-parameter persistent homology; these approaches rely on the Hilbert function, matching distance, and barcodes obtained from two-parameter persistence modules computed from the point-cloud data. Our statistical methods are broadly applicable for analysis of geometric data indexed by a real-valued parameter. We apply these approaches to analyze high-dimensional point-cloud data obtained from Simple English Wikipedia articles. In particular, we show how our methods can be utilized to distinguish certain subsets of the Wikipedia data and to compare with random data. These results yield insights into the construction of null distributions and stability of our methods with respect to noisy data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题