使用前向后型模型近似MDP同构

论文标题

使用前向后型模型近似MDP同构

Using Forwards-Backwards Models to Approximate MDP Homomorphisms

论文作者

Mavor-Parker, Augustine N., Sargent, Matthew J., Pehle, Christian, Banino, Andrea, Griffin, Lewis D., Barry, Caswell

论文摘要

强化学习者必须通过反复试验进行艰苦的学习，而哪些国家行动对等于价值 - 需要大量大量的环境经验。已经提出了MDP同态，将环境的MDP降低到抽象的MDP，从而提高样品效率。因此，当可以先验地构建合适的同态 - 通常是通过利用从业人员对环境对称性的知识，就可以实现令人印象深刻的改进。我们提出了一种在离散作用空间中构建同态的新方法，该方法使用学习的环境动力学模型来推断哪种状态行动对导致同一状态 - 可以将州行动空间的大小降低到与原始动作空间的基数一样大的因素。在Minatar中，我们报告在所有游戏和优化器上平均时，在低样本限制中，基于价值的离子基线的改善几乎为4倍。

Reinforcement learning agents must painstakingly learn through trial and error what sets of state-action pairs are value equivalent -- requiring an often prohibitively large amount of environment experience. MDP homomorphisms have been proposed that reduce the MDP of an environment to an abstract MDP, enabling better sample efficiency. Consequently, impressive improvements have been achieved when a suitable homomorphism can be constructed a priori -- usually by exploiting a practitioner's knowledge of environment symmetries. We propose a novel approach to constructing homomorphisms in discrete action spaces, which uses a learnt model of environment dynamics to infer which state-action pairs lead to the same state -- which can reduce the size of the state-action space by a factor as large as the cardinality of the original action space. In MinAtar, we report an almost 4x improvement over a value-based off-policy baseline in the low sample limit, when averaging over all games and optimizers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题