论文标题
为什么重新采样优于重新加权以用随机梯度纠正采样偏差
Why resampling outperforms reweighting for correcting sampling bias with stochastic gradients
论文作者
论文摘要
如果人口的亚组以与其基本比例明显不同的比例进行采样,则从某个人群中采样的数据集会偏差。培训机器学习模型对偏见数据集需要校正技术来补偿偏见。我们考虑两种常用技术,即重新采样和重新加权,可以重新平衡亚组的比例,以维持所需的目标函数。尽管在统计上等效,但已经观察到,与随机梯度算法相结合时,重新采样的表现优于重新加权。通过分析说明性示例,我们使用动态稳定性和随机渐近学的工具解释了这种现象背后的原因。我们还提出了回归,分类和非政策预测的实验,以证明这是一种普遍现象。我们认为,必须在解决采样偏差时一起考虑目标函数设计和优化算法。
A data set sampled from a certain population is biased if the subgroups of the population are sampled at proportions that are significantly different from their underlying proportions. Training machine learning models on biased data sets requires correction techniques to compensate for the bias. We consider two commonly-used techniques, resampling and reweighting, that rebalance the proportions of the subgroups to maintain the desired objective function. Though statistically equivalent, it has been observed that resampling outperforms reweighting when combined with stochastic gradient algorithms. By analyzing illustrative examples, we explain the reason behind this phenomenon using tools from dynamical stability and stochastic asymptotics. We also present experiments from regression, classification, and off-policy prediction to demonstrate that this is a general phenomenon. We argue that it is imperative to consider the objective function design and the optimization algorithm together while addressing the sampling bias.