论文标题
具有潜在的确定性和条件嵌入的POMDP的计算有效的PAC RL
Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings
论文作者
论文摘要
我们研究了通过功能近似的强化学习,以部分可观察到的马尔可夫决策过程(POMDP),其中状态空间和观察空间很大甚至连续。特别是,我们考虑了POMDP的Hilbert空间嵌入,其中潜在状态的特征和观察的特征允许有条件的Hilbert空间嵌入观测发射过程,而潜在状态过渡是确定性的。在函数近似设置下,最佳潜在状态行动$ q $ function在状态功能中是线性的,并且最佳$ q $函数的操作差距是差距,我们提供了一个\ emph {计算和统计上有效} algorithm,以查找\ emph {emph {expect primpt {精确的最佳}策略。我们在观察空间上相对于地平线和特征的固有维度来表明算法的计算和统计复杂性尺寸。此外,我们表明确定性潜在过渡和间隙假设是避免统计复杂性指数在地平线或维度中所必需的。由于我们的保证对状态和观察空间的大小没有明确的依赖,因此我们的算法可证明对大规模POMDPS。
We study reinforcement learning with function approximation for large-scale Partially Observable Markov Decision Processes (POMDPs) where the state space and observation space are large or even continuous. Particularly, we consider Hilbert space embeddings of POMDP where the feature of latent states and the feature of observations admit a conditional Hilbert space embedding of the observation emission process, and the latent state transition is deterministic. Under the function approximation setup where the optimal latent state-action $Q$-function is linear in the state feature, and the optimal $Q$-function has a gap in actions, we provide a \emph{computationally and statistically efficient} algorithm for finding the \emph{exact optimal} policy. We show our algorithm's computational and statistical complexities scale polynomially with respect to the horizon and the intrinsic dimension of the feature on the observation space. Furthermore, we show both the deterministic latent transitions and gap assumptions are necessary to avoid statistical complexity exponential in horizon or dimension. Since our guarantee does not have an explicit dependence on the size of the state and observation spaces, our algorithm provably scales to large-scale POMDPs.