论文标题
采样高效的加固学习与日志(t)切换成本
Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost
论文作者
论文摘要
我们研究了较低(策略)切换成本的加固学习问题(RL) - 现实生活中的RL应用程序使新政策的部署成本昂贵,并且策略更新的数量必须较低。在本文中,我们提出了一种基于阶段探索和自适应政策消除的新算法,使$ \ widetilde {o}(\ sqrt {h^4s^2at})$感到遗憾,同时需要$ O(HSA \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ log \ t)$。这是对最著名的开关成本$ O(H^2sa \ log T)$的指数改进。在上面,$ s,a $表示$ h $ horizon情节马尔可夫决策过程模型中的状态和行动数量未知,而$ t $是步骤的数量。作为我们新技术的副产品,我们还获得了一种无奖励探索算法,其切换成本为$ O(HSA)$。此外,我们证明了一对信息理论的下限,该界限说(1)任何无regret算法都必须具有$ω(HSA)$的开关成本; (2)任何$ \ widetilde {o}(\ sqrt {t})$遗憾算法必须产生$ω(HSA \ log \ log \ log t)$的开关成本。因此,我们的两种算法在其切换成本方面都是最佳的。
We study the problem of reinforcement learning (RL) with low (policy) switching cost - a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $\widetilde{O}(\sqrt{H^4S^2AT})$ while requiring a switching cost of $O(HSA \log\log T)$. This is an exponential improvement over the best-known switching cost $O(H^2SA\log T)$ among existing methods with $\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$ regret. In the above, $S,A$ denotes the number of states and actions in an $H$-horizon episodic Markov Decision Process model with unknown transitions, and $T$ is the number of steps. As a byproduct of our new techniques, we also derive a reward-free exploration algorithm with a switching cost of $O(HSA)$. Furthermore, we prove a pair of information-theoretical lower bounds which say that (1) Any no-regret algorithm must have a switching cost of $Ω(HSA)$; (2) Any $\widetilde{O}(\sqrt{T})$ regret algorithm must incur a switching cost of $Ω(HSA\log\log T)$. Both our algorithms are thus optimal in their switching costs.