论文标题
目标条件的强化学习最大化的事后期望最大化
Hindsight Expectation Maximization for Goal-conditioned Reinforcement Learning
论文作者
论文摘要
我们为目标条件的RL提出了一个图形模型框架,其EM算法在RL目标的下限上运行。 E-Step提供了一种自然的解释,即“事后学习”技术(例如她)如何处理非常稀疏的目标条件奖励。 M-Step将政策优化降低到监督的学习更新,该更新大大稳定了诸如图像之类的高维输入的端到端培训。我们表明,在具有稀疏奖励的广泛目标条件基准上,组合算法,下摆的合并算法显着优于模型基准。
We propose a graphical model framework for goal-conditioned RL, with an EM algorithm that operates on the lower bound of the RL objective. The E-step provides a natural interpretation of how 'learning in hindsight' techniques, such as HER, to handle extremely sparse goal-conditioned rewards. The M-step reduces policy optimization to supervised learning updates, which greatly stabilizes end-to-end training on high-dimensional inputs such as images. We show that the combined algorithm, hEM significantly outperforms model-free baselines on a wide range of goal-conditioned benchmarks with sparse rewards.