广义线性模型中过度拟合的复制品分析

论文标题

广义线性模型中过度拟合的复制品分析

Replica analysis of overfitting in generalized linear models

论文作者

Coolen, ACC, Sheikh, M, Mozeika, A, Aguirre-Lopez, F, Antenucci, F

论文摘要

几乎所有的统计推断方法都是针对该制度开发的，在该制度中，数据尺寸$ n $的数字$ n $远大于数据维度$ p $。推理协议（例如最大似然（ML）或最大后验概率（MAP），如果$ P = O（n）$，由于过度拟合而不可靠。这种限制对许多学科具有越来越高的数据的限制，成为严重的瓶颈。我们最近表明，在事件时间数据的COX回归中，过度拟合的错误不仅是噪声，而且要采用偏见的形式，以及统计物理学的复制方法曾经可以建模并预测这种偏差和噪声统计数据。在这里，我们将我们的方法扩展到任意广泛的线性回归模型（GLM），并可能相关的协变量。我们分析ML/MAP推断过度拟合，而不必指定数据类型或回归模型，仅依靠GLM表单，并为$ L2 $ priors的情况提供了通用订单参数方程。其次，我们得出了GLM中的真实和推断回归系数之间的概率关系，并表明，对于相关的超参数缩放和相关的协变量，$ L2 $正则化会导致系数矢量的可预测方向变化。我们的结果通过在线性，逻辑和COX回归中的应用说明，使人们可以系统地校正ML和映射GLMS中的推断，以使其过度拟合偏差，从而将其适用性扩展到迄今禁止的款项$ p = o（n）$。

Nearly all statistical inference methods were developed for the regime where the number $N$ of data samples is much larger than the data dimension $p$. Inference protocols such as maximum likelihood (ML) or maximum a posteriori probability (MAP) are unreliable if $p=O(N)$, due to overfitting. This limitation has for many disciplines with increasingly high-dimensional data become a serious bottleneck. We recently showed that in Cox regression for time-to-event data the overfitting errors are not just noise but take mostly the form of a bias, and how with the replica method from statistical physics once can model and predict this bias and the noise statistics. Here we extend our approach to arbitrary generalized linear regression models (GLM), with possibly correlated covariates. We analyse overfitting in ML/MAP inference without having to specify data types or regression models, relying only on the GLM form, and derive generic order parameter equations for the case of $L2$ priors. Second, we derive the probabilistic relationship between true and inferred regression coefficients in GLMs, and show that, for the relevant hyperparameter scaling and correlated covariates, the $L2$ regularization causes a predictable direction change of the coefficient vector. Our results, illustrated by application to linear, logistic, and Cox regression, enable one to correct ML and MAP inferences in GLMs systematically for overfitting bias, and thus extend their applicability into the hitherto forbidden regime $p=O(N)$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题