论文标题

在马尔可夫决策过程中学习在约束下

Learning in Markov Decision Processes under Constraints

论文作者

Singh, Rahul, Gupta, Abhishek, Shroff, Ness B.

论文摘要

我们考虑了马尔可夫决策过程中的强化学习(RL),其中代理与由受控的马尔可夫过程建模的环境反复相互作用。在每个时间步骤$ t $,它都会获得奖励,还会产生由$ M $成本组成的成本向量。我们设计了基于模型的RL算法,可最大程度地提高$ t $ time steps的累积奖励,同时确保$ M $ M $成本支出的平均值由特工指定的阈值$ C^{ub c^{ub c^{ub} _i,i = 1,2,i = 1,2,\ ldots,\ ldots,m $。 为了衡量满足平均成本限制的强化学习算法的性能,我们定义了一个$ M+1 $ dimensional遗憾的向量,该遗憾的向量由其奖励后悔和$ M $成本的遗憾组成。奖励后悔衡量了累积奖励中的次级优势,而成本遗憾的$ i $ th部分是其$ i $ th累计成本费用与预期成本支出$ tc^{ub} _i $之间的差额。 我们证明,ucrl-cmdp的后悔向量的预期值被上限为$ \ tilde {o} \ left(t^{2 \ slash 3} \ right)$,其中$ t $是时间范围。我们进一步展示了如何减少$ m $成本的所需子集的遗憾,而牺牲了奖励的遗憾和剩余的成本。据我们所知,我们的工作是在平均成本限制下考虑非剧本RL的唯一作品,并根据代理商对成本后悔的要求进行了〜\ emph {Tune Tune tune the Rearmph {Tune tune the Rearmph {tune tune the Rearmph {Tune tune {Tune tune {tune tune。

We consider reinforcement learning (RL) in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step $t$, it earns a reward, and also incurs a cost-vector consisting of $M$ costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of $T$ time-steps, while simultaneously ensuring that the average values of the $M$ cost expenditures are bounded by agent-specified thresholds $c^{ub}_i,i=1,2,\ldots,M$. In order to measure the performance of a reinforcement learning algorithm that satisfies the average cost constraints, we define an $M+1$ dimensional regret vector that is composed of its reward regret, and $M$ cost regrets. The reward regret measures the sub-optimality in the cumulative reward, while the $i$-th component of the cost regret vector is the difference between its $i$-th cumulative cost expense and the expected cost expenditures $Tc^{ub}_i$. We prove that the expected value of the regret vector of UCRL-CMDP, is upper-bounded as $\tilde{O}\left(T^{2\slash 3}\right)$, where $T$ is the time horizon. We further show how to reduce the regret of a desired subset of the $M$ costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers non-episodic RL under average cost constraints, and derive algorithms that can~\emph{tune the regret vector} according to the agent's requirements on its cost regrets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源