论文标题
选择用于模拟干预措施的数据增强
Selecting Data Augmentation for Simulating Interventions
论文作者
论文摘要
用纯观察数据训练的机器学习模型和经验风险最小化的原理\ citep {vapnik_principles_1992}可能无法推广到看不见的域。在本文中,我们重点介绍了问题通过观察到的域和实际任务标签之间的虚假相关性而产生的情况。我们发现许多领域的概括方法并未明确考虑这种虚假相关性。取而代之的是,尤其是在更面向应用程序的研究领域,例如医学成像或机器人技术,基于启发式方法的数据增强技术用于学习域不变特征。为了弥合理论与实践之间的鸿沟,我们就领域概括的问题发展了因果观点。我们认为,因果概念可以通过描述如何削弱观察到的域与任务标签之间的虚假相关性来解释数据扩展的成功。我们证明,数据增强可以作为模拟介入数据的工具。我们使用这些理论见解来得出一种简单的算法,该算法能够选择数据增强技术,从而导致更好的域概括。
Machine learning models trained with purely observational data and the principle of empirical risk minimization \citep{vapnik_principles_1992} can fail to generalize to unseen domains. In this paper, we focus on the case where the problem arises through spurious correlation between the observed domains and the actual task labels. We find that many domain generalization methods do not explicitly take this spurious correlation into account. Instead, especially in more application-oriented research areas like medical imaging or robotics, data augmentation techniques that are based on heuristics are used to learn domain invariant features. To bridge the gap between theory and practice, we develop a causal perspective on the problem of domain generalization. We argue that causal concepts can be used to explain the success of data augmentation by describing how they can weaken the spurious correlation between the observed domains and the task labels. We demonstrate that data augmentation can serve as a tool for simulating interventional data. We use these theoretical insights to derive a simple algorithm that is able to select data augmentation techniques that will lead to better domain generalization.