论文标题
在任意协方差依赖性下进行多个多样本测试
Multiple multi-sample testing under arbitrary covariance dependency
论文作者
论文摘要
现代的高通量生物医学设备通常会大规模生成数据,并且在生物医学研究中,高维数据集的分析已成为司空见惯。但是,在这些数据集中,鉴于成千上万的测量变量,提取有意义的功能会带来挑战。在本文中,我们提出了一个程序,以同时评估名义(分类)响应变量和多个特征之间关联的强度。具体而言,我们提出了一个在测试统计数据之间任意相关依赖性下进行大规模多重测试的框架。首先,针对每个功能单独执行边际多项式回归。其次,我们使用每个基线类别对的多个边缘模型的方法来建立边缘多项式回归系数的堆积载体的渐近关节正态性。第三,我们估计所有边缘模型的估计系数之间的(限制)协方差矩阵。最后,我们的方法近似于每个基线类别对的边缘p值的阈值过程的已实现的错误发现比例。拟议的方法在预期的真实和虚假拒绝数量之间提供了明智的权衡。此外,我们证明了该方法在高光谱成像数据上的实际应用。该数据集是通过基质辅助激光解吸/电离(MALDI)仪器获得的。 Maldi具有临床诊断的巨大潜力,特别是在癌症研究中。在我们的应用中,名义反应类别代表癌症亚型。
Modern high-throughput biomedical devices routinely produce data on a large scale, and the analysis of high-dimensional datasets has become commonplace in biomedical studies. However, given thousands or tens of thousands of measured variables in these datasets, extracting meaningful features poses a challenge. In this article, we propose a procedure to evaluate the strength of the associations between a nominal (categorical) response variable and multiple features simultaneously. Specifically, we propose a framework of large-scale multiple testing under arbitrary correlation dependency among test statistics. First, marginal multinomial regressions are performed for each feature individually. Second, we use an approach of multiple marginal models for each baseline-category pair to establish asymptotic joint normality of the stacked vector of the marginal multinomial regression coefficients. Third, we estimate the (limiting) covariance matrix between the estimated coefficients from all marginal models. Finally, our approach approximates the realized false discovery proportion of a thresholding procedure for the marginal p-values, for each baseline-category pair. The proposed approach offers a sensible trade-off between the expected numbers of true and false rejections. Furthermore, we demonstrate a practical application of the method on hyperspectral imaging data. This dataset is obtained by a matrix-assisted laser desorption/ionization (MALDI) instrument. MALDI demonstrates tremendous potential for clinical diagnosis, particularly for cancer research. In our application, the nominal response categories represent cancer subtypes.