线性匪徒中的互动学习偏好约束

论文标题

线性匪徒中的互动学习偏好约束

Interactively Learning Preference Constraints in Linear Bandits

论文作者

Lindner, David, Tschiatschek, Sebastian, Hofmann, Katja, Krause, Andreas

论文摘要

我们以已知的奖励和未知的约束来研究顺序决策，这是由约束代表昂贵评估人类偏好（例如安全舒适的驾驶行为）的情况所激发的。我们将互动学习这些约束作为新的线性匪徒问题的挑战正式化，我们称之为有限的线性最佳臂识别。为了解决这个问题，我们提出了自适应约束学习（ACOL）算法。我们为约束线性最佳臂识别提供了一个依赖实例的下限，并表明Acol的样品复杂性与最坏情况下的下限匹配。在平均情况下，ACOL的样品复杂性结合仍然比简单方法的边界更紧密。在合成实验中，ACOL与Oracle溶液的表现相同，并且表现优于一系列基准。作为应用程序，我们考虑学习限制，以代表驾驶模拟中的人类偏好。对于此应用，ACOL比替代方案要高得多。此外，我们发现学习偏好作为约束对驾驶场景的变化比直接编码奖励函数中的偏好更强大。

We study sequential decision-making with known rewards and unknown constraints, motivated by situations where the constraints represent expensive-to-evaluate human preferences, such as safe and comfortable driving behavior. We formalize the challenge of interactively learning about these constraints as a novel linear bandit problem which we call constrained linear best-arm identification. To solve this problem, we propose the Adaptive Constraint Learning (ACOL) algorithm. We provide an instance-dependent lower bound for constrained linear best-arm identification and show that ACOL's sample complexity matches the lower bound in the worst-case. In the average case, ACOL's sample complexity bound is still significantly tighter than bounds of simpler approaches. In synthetic experiments, ACOL performs on par with an oracle solution and outperforms a range of baselines. As an application, we consider learning constraints to represent human preferences in a driving simulation. ACOL is significantly more sample efficient than alternatives for this application. Further, we find that learning preferences as constraints is more robust to changes in the driving scenario than encoding the preferences directly in the reward function.

下载PDF全文

下载文献需遵守相关版权规定

论文标题