论文标题
风险敏感性成本MDP的政策梯度算法
A Policy Gradient Algorithm for the Risk-Sensitive Exponential Cost MDP
论文作者
论文摘要
我们研究了风险敏感的指数成本MDP公式,并开发基于轨迹的梯度算法,以找到与一组参数化策略相关的成本的固定点。我们得出一个公式,可用于从MDP的样本路径中为每个固定参数化策略收集的(状态,行动,成本)信息(状态,行动,成本)信息。与传统的平均成本问题不同,标准随机近似理论不能用于利用此公式。为了解决这个问题,我们引入了对风险敏感成本的截断且平稳的版本,并表明该新成本标准可用于在某些温和的假设下统一近似风险敏感的成本及其梯度。然后,我们开发一种基于轨迹的梯度算法,以最大程度地减少对风险敏感成本的平滑截断估计,并得出条件,在该条件下,可以使用一系列截断来解决原始的,未截断的成本问题。
We study the risk-sensitive exponential cost MDP formulation and develop a trajectory-based gradient algorithm to find the stationary point of the cost associated with a set of parameterized policies. We derive a formula that can be used to compute the policy gradient from (state, action, cost) information collected from sample paths of the MDP for each fixed parameterized policy. Unlike the traditional average-cost problem, standard stochastic approximation theory cannot be used to exploit this formula. To address the issue, we introduce a truncated and smooth version of the risk-sensitive cost and show that this new cost criterion can be used to approximate the risk-sensitive cost and its gradient uniformly under some mild assumptions. We then develop a trajectory-based gradient algorithm to minimize the smooth truncated estimation of the risk-sensitive cost and derive conditions under which a sequence of truncations can be used to solve the original, untruncated cost problem.