稳定从像素的非政策深度加固学习

论文标题

稳定从像素的非政策深度加固学习

Stabilizing Off-Policy Deep Reinforcement Learning from Pixels

论文作者

Cetin, Edoardo, Ball, Philip J., Roberts, Steve, Celiktutan, Oya

论文摘要

众所周知，从像素观测中进行的政体钢筋学习（RL）是不稳定的。结果，许多成功的算法必须结合不同领域的实践和辅助损失，以在复杂的环境中学习有意义的行为。在这项工作中，我们提供了新颖的分析，表明这些不稳定性是通过卷积编码器和低稳定度奖励进行时间差异学习而产生的。我们表明，这种新的视觉致命三合会导致不稳定的训练和过早的融合归化解决方案，这是一种现象，我们将灾难性的自相传为命名。基于我们的分析，我们提出了A-LIX，这是一种为编码器梯度提供适应性正则化的方法，该梯度明确防止使用双重目标防止灾难性的自我屈服。通过应用A-LIX，我们在DeepMind Control和Atari 100K基准测试方面显着优于先前的最新最新，而没有任何数据增强或辅助损失。

Off-policy reinforcement learning (RL) from pixel observations is notoriously unstable. As a result, many successful algorithms must combine different domain-specific practices and auxiliary losses to learn meaningful behaviors in complex environments. In this work, we provide novel analysis demonstrating that these instabilities arise from performing temporal-difference learning with a convolutional encoder and low-magnitude rewards. We show that this new visual deadly triad causes unstable training and premature convergence to degenerate solutions, a phenomenon we name catastrophic self-overfitting. Based on our analysis, we propose A-LIX, a method providing adaptive regularization to the encoder's gradients that explicitly prevents the occurrence of catastrophic self-overfitting using a dual objective. By applying A-LIX, we significantly outperform the prior state-of-the-art on the DeepMind Control and Atari 100k benchmarks without any data augmentation or auxiliary losses.

下载PDF全文

下载文献需遵守相关版权规定

论文标题