引导变压器用于离线增强学习

论文标题

引导变压器用于离线增强学习

Bootstrapped Transformer for Offline Reinforcement Learning

论文作者

Wang, Kerong, Zhao, Hanye, Luo, Xufang, Ren, Kan, Zhang, Weinan, Li, Dongsheng

论文摘要

离线增强学习（RL）的目的是从以前收集的静态轨迹数据中学习政策，而无需与真实环境互动。最近的作品通过将离线RL视为一个通用序列生成问题，从而提供了一种新的视角，该序列模型（例如变压器体系结构）可以通过轨迹对分布进行建模，并将光束搜索重新用于计划算法。但是，在一般离线RL任务中使用的培训数据集非常有限，并且通常遭受分配覆盖范围不足，这可能对训练序列生成模型有害，但在先前的工作中没有引起足够的关注。在本文中，我们提出了一种名为Boottrapped Transformer的新型算法，该算法结合了引导的想法，并利用学习的模型以自我生成更多的离线数据，以进一步增强序列模型训练。我们对两个离线RL基准测试进行了广泛的实验，并证明我们的模型可以在很大程度上纠正现有的离线RL训练限制并击败其他强大的基线方法。我们还分析了生成的伪数据，并且显示的特征可能会揭示离线RL训练。这些代码可在https://seqml.github.io/bootorl上找到。

Offline reinforcement learning (RL) aims at learning policies from previously collected static trajectory data without interacting with the real environment. Recent works provide a novel perspective by viewing offline RL as a generic sequence generation problem, adopting sequence models such as Transformer architecture to model distributions over trajectories, and repurposing beam search as a planning algorithm. However, the training datasets utilized in general offline RL tasks are quite limited and often suffer from insufficient distribution coverage, which could be harmful to training sequence generation models yet has not drawn enough attention in the previous works. In this paper, we propose a novel algorithm named Bootstrapped Transformer, which incorporates the idea of bootstrapping and leverages the learned model to self-generate more offline data to further boost the sequence model training. We conduct extensive experiments on two offline RL benchmarks and demonstrate that our model can largely remedy the existing offline RL training limitations and beat other strong baseline methods. We also analyze the generated pseudo data and the revealed characteristics may shed some light on offline RL training. The codes are available at https://seqml.github.io/bootorl.

下载PDF全文

下载文献需遵守相关版权规定

论文标题