在上下文匪徒中的保形非政策预测

论文标题

在上下文匪徒中的保形非政策预测

Conformal Off-Policy Prediction in Contextual Bandits

论文作者

Taufiq, Muhammad Faaiz, Ton, Jean-Francois, Cornish, Rob, Teh, Yee Whye, Doucet, Arnaud

论文摘要

上下文匪徒的大多数非政策评估方法都集中在政策的预期结果上，该方法是通过最多只能提供渐近保证的方法来估算的。但是，在许多应用中，期望可能不是最佳绩效衡量标准，因为它不会捕获结果的可变性。此外，特别是在关键安全环境中，可能需要比渐近正确性更强的保证。为了解决这些局限性，我们考虑了对上下文匪徒的保形预测的新颖应用。给定根据行为策略收集的数据，我们提出\ emph {共形外利预测}（COPP），该数据可以在新目标策略下为结果输出可靠的预测间隔。我们提供理论有限样本的保证，而无需做出超出标准上下文匪徒设置的任何其他假设，并且与现有的合成和现实世界数据相比，经验证明了COPP的实用性。

Most off-policy evaluation methods for contextual bandits have focused on the expected outcome of a policy, which is estimated via methods that at best provide only asymptotic guarantees. However, in many applications, the expectation may not be the best measure of performance as it does not capture the variability of the outcome. In addition, particularly in safety-critical settings, stronger guarantees than asymptotic correctness may be required. To address these limitations, we consider a novel application of conformal prediction to contextual bandits. Given data collected under a behavioral policy, we propose \emph{conformal off-policy prediction} (COPP), which can output reliable predictive intervals for the outcome under a new target policy. We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup, and empirically demonstrate the utility of COPP compared with existing methods on synthetic and real-world data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题