对具有长期限制的MDP的无模型算法和遗憾分析

论文标题

对具有长期限制的MDP的无模型算法和遗憾分析

Model-Free Algorithm and Regret Analysis for MDPs with Long-Term Constraints

论文作者

Bai, Qinbo, Aggarwal, Vaneet, Gattami, Ather

论文摘要

在优化动态系统时，变量通常具有约束。这些问题可以建模为受约束的马尔可夫决策过程（CMDP）。本文考虑了不知道过渡概率的问题的无模型方法。在存在长期（或平均）约束的情况下，代理必须选择一项政策，以最大程度地提高长期平均奖励，并满足每个情节中的平均约束。长期限制的主要挑战是，最佳策略一般不是确定性的，因此不能直接使用标准的Q学习方法。本文使用限制优化和Q学习的概念来提出具有长期约束的CMDP算法。对于任何$γ\ in（0，\ frac {1} {2}）$，提出的算法被证明可以实现$ o（t^{1/2+γ}）$遗憾的遗憾，以获得获得的奖励和$ o（t^{1-γ/2}）$遗憾的是违反限制的违规行为，$ t $ t $是$ t $ septep $ septep $。我们注意到，这些是对MDP的遗憾分析的第一个结果，其中长期限制是不知道过渡概率的。

In the optimization of dynamical systems, the variables typically have constraints. Such problems can be modeled as a constrained Markov Decision Process (CMDP). This paper considers a model-free approach to the problem, where the transition probabilities are not known. In the presence of long-term (or average) constraints, the agent has to choose a policy that maximizes the long-term average reward as well as satisfy the average constraints in each episode. The key challenge with the long-term constraints is that the optimal policy is not deterministic in general, and thus standard Q-learning approaches cannot be directly used. This paper uses concepts from constrained optimization and Q-learning to propose an algorithm for CMDP with long-term constraints. For any $γ\in(0,\frac{1}{2})$, the proposed algorithm is shown to achieve $O(T^{1/2+γ})$ regret bound for the obtained reward and $O(T^{1-γ/2})$ regret bound for the constraint violation, where $T$ is the total number of steps. We note that these are the first results on regret analysis for MDP with long-term constraints, where the transition probabilities are not known apriori.

下载PDF全文

下载文献需遵守相关版权规定

论文标题