论文标题
我们需要谈论随机分裂
We Need to Talk About Random Splits
论文作者
论文摘要
Gorman和Bedrick(2019)主张在NLP实验中使用随机分裂而不是标准分裂。我们认为,随机分裂(如标准分裂)会导致过度乐观的性能估计。我们还可以以有偏见或对抗性的方式拆分数据,例如,对简短句子进行培训并评估长句子。偏置采样已在域的适应性中使用,以模拟现实世界漂移。这就是所谓的协变性假设。然而,在NLP中,即使是最坏情况下的分裂,最大化偏差,通常低估了在新的域数据样本中观察到的错误,即模型应在测试时最小化的数据。这使协变量移位假设无效。未来的基准不是使用多个随机拆分,而是理想地包括多个独立的测试集。如果不可行的话,我们认为比多个随机拆分更现实的性能估计。
Gorman and Bedrick (2019) argued for using random splits rather than standard splits in NLP experiments. We argue that random splits, like standard splits, lead to overly optimistic performance estimates. We can also split data in biased or adversarial ways, e.g., training on short sentences and evaluating on long ones. Biased sampling has been used in domain adaptation to simulate real-world drift; this is known as the covariate shift assumption. In NLP, however, even worst-case splits, maximizing bias, often under-estimate the error observed on new samples of in-domain data, i.e., the data that models should minimally generalize to at test time. This invalidates the covariate shift assumption. Instead of using multiple random splits, future benchmarks should ideally include multiple, independent test sets instead; if infeasible, we argue that multiple biased splits leads to more realistic performance estimates than multiple random splits.