高维稀疏线性上下文匪徒中的动态批处理学习

论文标题

高维稀疏线性上下文匪徒中的动态批处理学习

Dynamic Batch Learning in High-Dimensional Sparse Linear Contextual Bandits

论文作者

Ren, Zhimei, Zhou, Zhengyuan

论文摘要

我们研究了在高维稀疏线性上下文匪徒中动态批处理学习的问题，在给定的最大批处理约束下，决策者只能在每批批次结束时观察奖励，可以动态地决定在下批次中包括多少个人（在当前批次结束时），以及在哪些个性化行动中，可以在哪些个性化的动作计划中采用哪些个性化的计划。在各种实际情况下，这种批处理限制无处不在，包括在临床试验中选择营销和医疗选择的个性化产品。我们通过后悔的下限来表征此问题中的基本学习限制，并提供匹配的上限（延伸到日志因素），从而为此问题开了一个最佳方案。据我们所知，我们的工作为在高维稀疏线性上下文匪徒中对动态批处理学习的理论理解提供了第一个侵入。值得注意的是，即使我们的结果的特殊情况 - 当不存在批处理约束时 - 产生了使用拉索估计器的简单无探索算法已经实现了在高维线性上下文划线中标准在线学习的最小值最佳遗憾（对于高维案例）（对于无严格的案例），这一结果在高度较高的上下文中似乎是无知的。

We study the problem of dynamic batch learning in high-dimensional sparse linear contextual bandits, where a decision maker, under a given maximum-number-of-batch constraint and only able to observe rewards at the end of each batch, can dynamically decide how many individuals to include in the next batch (at the end of the current batch) and what personalized action-selection scheme to adopt within each batch. Such batch constraints are ubiquitous in a variety of practical contexts, including personalized product offerings in marketing and medical treatment selection in clinical trials. We characterize the fundamental learning limit in this problem via a regret lower bound and provide a matching upper bound (up to log factors), thus prescribing an optimal scheme for this problem. To the best of our knowledge, our work provides the first inroad into a theoretical understanding of dynamic batch learning in high-dimensional sparse linear contextual bandits. Notably, even a special case of our result -- when no batch constraint is present -- yields that the simple exploration-free algorithm using the LASSO estimator already achieves the minimax optimal regret bound for standard online learning in high-dimensional linear contextual bandits (for the no-margin case), a result that appears unknown in the emerging literature of high-dimensional contextual bandits.

下载PDF全文

下载文献需遵守相关版权规定

论文标题