评估生成用于增强小样本数据集的合成表格数据

论文标题

评估生成用于增强小样本数据集的合成表格数据

Evaluating Synthetic Tabular Data Generated To Augment Small Sample Datasets

论文作者

Marin, Javier

论文摘要

这项工作提出了一种评估生成用于增强小样本数据集的合成表格数据的方法。尽管数据增强技术可以增加机器学习应用程序的样本计数，但当应用于极有限的样本量时，传统验证方法失败了。我们在四个数据集中进行的实验揭示了全球指标和拓扑度量之间的明显不一致，统计检验由于样本量不足而产生不可靠的显着性值。我们证明，诸如倾向评分和MMD之类的常见指标通常表明存在基本拓扑差异的相似性。我们提出的基于标准化的基于瓶颈距离的度量提供了互补的见解，但在实验运行和偶尔值之间遭受了高度的变异性，超过了理论界限，显示出非常小的数据集中拓扑方法的固有不稳定。这些发现突出了在验证从有限样本产生的合成数据时对多方面评估方法的关键需求，因为没有单一的度量可靠地捕获分布和结构相似性。

This work proposes a method to evaluate synthetic tabular data generated to augment small sample datasets. While data augmentation techniques can increase sample counts for machine learning applications, traditional validation approaches fail when applied to extremely limited sample sizes. Our experiments across four datasets reveal significant inconsistencies between global metrics and topological measures, with statistical tests producing unreliable significance values due to insufficient sample sizes. We demonstrate that common metrics like propensity scoring and MMD often suggest similarity where fundamental topological differences exist. Our proposed normalized Bottleneck distance based metric provides complementary insights but suffers from high variability across experimental runs and occasional values exceeding theoretical bounds, showing inherent instability in topological approaches for very small datasets. These findings highlight the critical need for multi-faceted evaluation methodologies when validating synthetic data generated from limited samples, as no single metric reliably captures both distributional and structural similarity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题