在存在异常值的情况下的概括范围：中位数研究

论文标题

在存在异常值的情况下的概括范围：中位数研究

Generalization Bounds in the Presence of Outliers: a Median-of-Means Study

论文作者

Laforgue, Pierre, Staerman, Guillaume, Clémençon, Stephan

论文摘要

与经验平均值相反，平均值（MOM）是平均$θ$的平均$θ$的估计器。 $ z $，即使$ z $没有表现出次高斯的尾巴行为，也可以建立准确的非矩形信心范围。由于它对重尾数据具有高度的信心，妈妈在机器学习中发现了各种应用，它用于设计对非典型观察不敏感的培训程序。最近，一项新的工作正在试图表征和利用妈妈处理损坏数据的能力。在这种情况下，目前的工作提出了对妈妈在污染制度下的集中特性的一般研究，该研究清楚地理解了异常比例的影响和所选择的块数量。该分析扩展到（多样本）$ u $统计量，即观测到的平均值，这会引起由于诱导的依赖性而引起的额外挑战。最后，我们表明后一个界限可以直接使用，以在受污染的环境中获得成对学习的概括保证，并提出算法来计算可靠的可靠决策功能。

In contrast to the empirical mean, the Median-of-Means (MoM) is an estimator of the mean $θ$ of a square integrable r.v. $Z$, around which accurate nonasymptotic confidence bounds can be built, even when $Z$ does not exhibit a sub-Gaussian tail behavior. Thanks to the high confidence it achieves on heavy-tailed data, MoM has found various applications in machine learning, where it is used to design training procedures that are not sensitive to atypical observations. More recently, a new line of work is now trying to characterize and leverage MoM's ability to deal with corrupted data. In this context, the present work proposes a general study of MoM's concentration properties under the contamination regime, that provides a clear understanding of the impact of the outlier proportion and the number of blocks chosen. The analysis is extended to (multisample) $U$-statistics, i.e. averages over tuples of observations, that raise additional challenges due to the dependence induced. Finally, we show that the latter bounds can be used in a straightforward fashion to derive generalization guarantees for pairwise learning in a contaminated setting, and propose an algorithm to compute provably reliable decision functions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题