论文标题
meta-sac:自动调节软演员批评的熵温度
Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via Metagradient
论文作者
论文摘要
长期以来,探索 - 开发困境一直是增强学习的关键问题。在本文中,我们提出了一种新的方法,以自动在这两者之间平衡。我们的方法建立在柔软的参与者(SAC)算法上,该算法使用“熵温度”来平衡原始任务奖励和政策熵,因此控制了剥削和探索之间的权衡。从经验上表明,SAC对此超参数非常敏感,并且使用受限的优化进行自动调整的后续工作(SAC-V2)具有一定的局限性。我们方法的核心,即meta-sac,是使用Metagradient以及一个新型的元目标来自动调整SAC中的熵温度。我们表明,Meta-SAC在几个Mujoco基准测试任务上实现了有希望的表现,并且在最具挑战性的人类V2中,超过10%的SAC-V2超过10%。
Exploration-exploitation dilemma has long been a crucial issue in reinforcement learning. In this paper, we propose a new approach to automatically balance between these two. Our method is built upon the Soft Actor-Critic (SAC) algorithm, which uses an "entropy temperature" that balances the original task reward and the policy entropy, and hence controls the trade-off between exploitation and exploration. It is empirically shown that SAC is very sensitive to this hyperparameter, and the follow-up work (SAC-v2), which uses constrained optimization for automatic adjustment, has some limitations. The core of our method, namely Meta-SAC, is to use metagradient along with a novel meta objective to automatically tune the entropy temperature in SAC. We show that Meta-SAC achieves promising performances on several of the Mujoco benchmarking tasks, and outperforms SAC-v2 over 10% in one of the most challenging tasks, humanoid-v2.