非平稳的马尔可夫环境的基于设定的价值运算符

论文标题

非平稳的马尔可夫环境的基于设定的价值运算符

Set-based value operators for non-stationary Markovian environments

论文作者

Li, Sarah H. Q., Adjé, Assalé, Garoche, Pierre-Loïc, Açıkmeşe, Behçet

论文摘要

本文分析了有限状态马尔可夫决策过程（MDPS），其中不确定的参数在紧凑的集合中，并通过基于集合的固定点理论从鲁棒的MDP产生重新检查。为此，我们将Bellman和政策评估运营商推广到价值功能空间的承包运营商，并将其表示为\ Emph {Value Operators}。我们提起这些值运算符，以对值函数的\ emph {sets}作用，并将其表示为\ emph {基于set的价值运算符}。我们证明，在紧凑型值函数集的空间中，基于设定的值运算符是\ emph {cartfertions}。利用集合理论的洞察力，我们将经典鲁棒MDP文献中的矩形条件推广到所有价值操作员的控制条件，该价值运算符较弱，可以应用于动态程序中的一组较大的参数 - 不确定的MDP和承包商。我们证明，矩形条件和遏制条件都足够确保基于设定的值运算符的固定点集包含其自己的极端元素。对于不确定的MDP参数的凸和紧凑型集，我们显示了经典的鲁棒值函数与基于集合的Bellman运算符的固定点集的上限之间的等效性。在紧凑的集合中动态更改的MDP参数下，我们证明了值迭代的集合融合结果，否则它可能不会收敛到单个值函数。最后，我们为行星探索和平流层站点中的概率路径规划问题提供了新的保证。

This paper analyzes finite state Markov Decision Processes (MDPs) with uncertain parameters in compact sets and re-examines results from robust MDP via set-based fixed point theory. To this end, we generalize the Bellman and policy evaluation operators to contracting operators on the value function space and denote them as \emph{value operators}. We lift these value operators to act on \emph{sets} of value functions and denote them as \emph{set-based value operators}. We prove that the set-based value operators are \emph{contractions} in the space of compact value function sets. Leveraging insights from set theory, we generalize the rectangularity condition in classic robust MDP literature to a containment condition for all value operators, which is weaker and can be applied to a larger set of parameter-uncertain MDPs and contracting operators in dynamic programming. We prove that both the rectangularity condition and the containment condition sufficiently ensure that the set-based value operator's fixed point set contains its own extrema elements. For convex and compact sets of uncertain MDP parameters, we show equivalence between the classic robust value function and the supremum of the fixed point set of the set-based Bellman operator. Under dynamically changing MDP parameters in compact sets, we prove a set convergence result for value iteration, which otherwise may not converge to a single value function. Finally, we derive novel guarantees for probabilistic path-planning problems in planet exploration and stratospheric station-keeping.

下载PDF全文

下载文献需遵守相关版权规定

论文标题