了解连续决策过程的随机动力学：多臂匪徒的路径综合分析

论文标题

了解连续决策过程的随机动力学：多臂匪徒的路径综合分析

Understanding the stochastic dynamics of sequential decision-making processes: A path-integral analysis of multi-armed bandits

论文作者

Li, Bo, Yeung, Chi Ho

论文摘要

多臂强盗（MAB）模型是在不确定环境中研究决策的最古典模型之一。在此型号中，玩家选择了$ k $的一个可能的匪徒手臂之一，在每个时间步骤中播放，相应的臂向玩家返回随机奖励，可能来自特定的未知分布。玩家的目标是在此过程中收集尽可能多的奖励。尽管它很简单，但MAB模型还是为研究探索与开发与设计有效算法之间的折衷的绝佳操场，以在不确定性下进行顺序决策。尽管已经建立了许多在MAB模型的随机动力学的有限时间行为，但由于决策和收集的奖励之间的相互关系，分析的随机动力学的有限时间行为似乎更具挑战性。在本文中，我们采用统计物理学中的技术来分析MAB模型，该模型促进了在有限的短时间内累积后悔分布的表征，对mAb算法的中心含量以及模型的复杂动力学行为。我们的分析结果与模拟很好地吻合，表明出现了有趣的多模式分布，由于最初的最初不幸输出的最初不幸输出而导致了极端武器的过量剥削，因此产生了巨大的遗憾。

The multi-armed bandit (MAB) model is one of the most classical models to study decision-making in an uncertain environment. In this model, a player chooses one of $K$ possible arms of a bandit machine to play at each time step, where the corresponding arm returns a random reward to the player, potentially from a specific unknown distribution. The target of the player is to collect as many rewards as possible during the process. Despite its simplicity, the MAB model offers an excellent playground for studying the trade-off between exploration versus exploitation and designing effective algorithms for sequential decision-making under uncertainty. Although many asymptotically optimal algorithms have been established, the finite-time behaviors of the stochastic dynamics of the MAB model appear much more challenging to analyze, due to the intertwine between the decision-making and the rewards being collected. In this paper, we employ techniques in statistical physics to analyze the MAB model, which facilitates the characterization of the distribution of cumulative regrets at a finite short time, the central quantity of interest in an MAB algorithm, as well as the intricate dynamical behaviors of the model. Our analytical results, in good agreement with simulations, point to the emergence of an interesting multimodal regret distribution, with large regrets resulting from excess exploitation of sub-optimal arms due to an initial unlucky output from the optimal one.

下载PDF全文

下载文献需遵守相关版权规定

论文标题