meta-sac：自动调节软演员批评的熵温度

论文标题

meta-sac：自动调节软演员批评的熵温度

Meta-SAC: Auto-tune the Entropy Temperature of Soft Actor-Critic via Metagradient

论文作者

Wang, Yufei, Ni, Tianwei

论文摘要

长期以来，探索 - 开发困境一直是增强学习的关键问题。在本文中，我们提出了一种新的方法，以自动在这两者之间平衡。我们的方法建立在柔软的参与者（SAC）算法上，该算法使用“熵温度”来平衡原始任务奖励和政策熵，因此控制了剥削和探索之间的权衡。从经验上表明，SAC对此超参数非常敏感，并且使用受限的优化进行自动调整的后续工作（SAC-V2）具有一定的局限性。我们方法的核心，即meta-sac，是使用Metagradient以及一个新型的元目标来自动调整SAC中的熵温度。我们表明，Meta-SAC在几个Mujoco基准测试任务上实现了有希望的表现，并且在最具挑战性的人类V2中，超过10％的SAC-V2超过10％。

Exploration-exploitation dilemma has long been a crucial issue in reinforcement learning. In this paper, we propose a new approach to automatically balance between these two. Our method is built upon the Soft Actor-Critic (SAC) algorithm, which uses an "entropy temperature" that balances the original task reward and the policy entropy, and hence controls the trade-off between exploitation and exploration. It is empirically shown that SAC is very sensitive to this hyperparameter, and the follow-up work (SAC-v2), which uses constrained optimization for automatic adjustment, has some limitations. The core of our method, namely Meta-SAC, is to use metagradient along with a novel meta objective to automatically tune the entropy temperature in SAC. We show that Meta-SAC achieves promising performances on several of the Mujoco benchmarking tasks, and outperforms SAC-v2 over 10% in one of the most challenging tasks, humanoid-v2.

下载PDF全文

下载文献需遵守相关版权规定

论文标题