一种自我调整的演员批评算法

论文标题

一种自我调整的演员批评算法

A Self-Tuning Actor-Critic Algorithm

论文作者

Zahavy, Tom, Xu, Zhongwen, Veeriah, Vivek, Hessel, Matteo, Oh, Junhyuk, van Hasselt, Hado, Silver, David, Singh, Satinder

论文摘要

增强学习算法对选择超参数的选择高度敏感，通常需要大量的手动努力来识别在新领域上表现良好的超参数。在本文中，我们通过使用Meta-Meta-Meta-radeent Descent在线自动调整Metagradients自动适应超参数（Xu等，2018）来解决这个问题。我们应用算法，自我调整的参与者批评（STAC），自我调整参与者 - 批判性损失函数的所有可区分的超参数，发现辅助任务，并使用新型的泄漏V-Trace操作员来改善额外的学习。 STAC易于使用，样品有效，并且不需要大幅度增加计算。烧蚀性研究表明，随着我们适应更多的超参数，STAC的总体性能得到了改善。当应用于街机学习环境时（Bellemare等人，2012年），STAC将200M步骤的中位人类正常得分从243％提高到364％。当应用于DM对照套件（Tassa等，2018）时，STAC将30m步骤的平均得分从217步骤提高到389，当时使用特征学习，从108到202，从像素学习时，从195到295，从195到295，从现实世界中的强化学习挑战（Dulac-Arnold Challenge挑战，2020年）。

Reinforcement learning algorithms are highly sensitive to the choice of hyperparameters, typically requiring significant manual effort to identify hyperparameters that perform well on a new domain. In this paper, we take a step towards addressing this issue by using metagradients to automatically adapt hyperparameters online by meta-gradient descent (Xu et al., 2018). We apply our algorithm, Self-Tuning Actor-Critic (STAC), to self-tune all the differentiable hyperparameters of an actor-critic loss function, to discover auxiliary tasks, and to improve off-policy learning using a novel leaky V-trace operator. STAC is simple to use, sample efficient and does not require a significant increase in compute. Ablative studies show that the overall performance of STAC improved as we adapt more hyperparameters. When applied to the Arcade Learning Environment (Bellemare et al. 2012), STAC improved the median human normalized score in 200M steps from 243% to 364%. When applied to the DM Control suite (Tassa et al., 2018), STAC improved the mean score in 30M steps from 217 to 389 when learning with features, from 108 to 202 when learning from pixels, and from 195 to 295 in the Real-World Reinforcement Learning Challenge (Dulac-Arnold et al., 2020).

下载PDF全文

下载文献需遵守相关版权规定

论文标题