深入强化学习中的自我监督探索的变化动态

论文标题

深入强化学习中的自我监督探索的变化动态

Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning

论文作者

Bai, Chenjia, Liu, Peng, Liu, Kaiyu, Wang, Lingxiao, Zhao, Yingnan, Han, Lei

论文摘要

在加强学习中，有效的探索仍然是一个具有挑战性的问题，尤其是对于来自环境稀疏甚至完全无视的外部奖励的任务。基于内在动机的重大进展在简单的环境中显示出令人鼓舞的结果，但经常被困在具有多模式和随机动力学的环境中。在这项工作中，我们提出了一个基于条件变异推断的变异动态模型，以模拟多模式和随机性。我们将环境状况过渡视为有条件的生成过程，通过在当前状态，动作和潜在变量的条件下生成下一州预测，从而更好地理解动态并带来更好的探索性能。我们得出了环境过渡的负模拟可能性的上限，并将其使用像探索的内在奖励一样上限，这使代理商可以通过自我监督的探索来学习技能，而无需观察外部奖励。我们在几个基于图像的仿真任务和真正的机器人操纵任务上评估了建议的方法。我们的方法的表现优于几种基于环境模型的最先进的探索方法。

Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state-action transition as a conditional generative process by generating the next-state prediction under the condition of the current state, action, and latent variable, which provides a better understanding of the dynamics and leads a better performance in exploration. We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-the-art environment model-based exploration approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题