论文标题
集群分析的统计能力
Statistical power for cluster analysis
论文作者
论文摘要
集群算法在生物医学研究中越来越受欢迎,因为它们在数据中识别离散亚组的能力以及它们在主流软件中的可访问性的增加。尽管存在算法选择和结果评估的指南,但尚无确立的计算群集分析先验统计能力的方法。在这里,我们通过模拟估算了通用分析管道的功率和准确性。我们各种各样的亚组大小,数字,分离(效果大小)和协方差结构。然后,我们将生成的数据集降低(无,多维缩放或UMAP)和集群算法(K-Means,与病房或平均链接以及欧几里得或余弦距离的HDBSCAN)。最后,我们比较了离散(k-均值),“模糊”(c均值)和有限混合模型方法(包括潜在轮廓和潜在类别分析)的统计能力。我们发现,结果是由较大的效应大小或跨特征较小效应的积累驱动的,并且不受协方差结构差异的影响。只要群集分离很大(δ= 4),可以使用相对较小的样品(每个亚组n = 20)实现足够的统计能力。模糊聚类提供了一种更简约和强大的替代方案,用于识别可分离的多元正常分布,尤其是那些质心分离略低的分布(δ= 3)。总体而言,我们建议研究人员1)仅在预期大型亚组分离时仅应用聚类分析,2)旨在n = 20至20至n = 30的样本量每个预期亚组,3)3)使用多维缩放来改善群集分离,4)使用模糊聚类或有限混合模型,它们具有更强大的有限混合模型,并且具有更强大的部分,并且具有更大的分配型综合分配。
Cluster algorithms are increasingly popular in biomedical research due to their compelling ability to identify discrete subgroups in data, and their increasing accessibility in mainstream software. While guidelines exist for algorithm selection and outcome evaluation, there are no firmly established ways of computing a priori statistical power for cluster analysis. Here, we estimated power and accuracy for common analysis pipelines through simulation. We varied subgroup size, number, separation (effect size), and covariance structure. We then subjected generated datasets to dimensionality reduction (none, multidimensional scaling, or UMAP) and cluster algorithms (k-means, agglomerative hierarchical clustering with Ward or average linkage and Euclidean or cosine distance, HDBSCAN). Finally, we compared the statistical power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling approaches (which include latent profile and latent class analysis). We found that outcomes were driven by large effect sizes or the accumulation of many smaller effects across features, and were unaffected by differences in covariance structure. Sufficient statistical power was achieved with relatively small samples (N=20 per subgroup), provided cluster separation is large (Δ=4). Fuzzy clustering provided a more parsimonious and powerful alternative for identifying separable multivariate normal distributions, particularly those with slightly lower centroid separation (Δ=3). Overall, we recommend that researchers 1) only apply cluster analysis when large subgroup separation is expected, 2) aim for sample sizes of N=20 to N=30 per expected subgroup, 3) use multidimensional scaling to improve cluster separation, and 4) use fuzzy clustering or finite mixture modelling approaches that are more powerful and more parsimonious with partially overlapping multivariate normal distributions.