熟练：在增强学习中快速转移的各种合奏

论文标题

熟练：在增强学习中快速转移的各种合奏

DEFT: Diverse Ensembles for Fast Transfer in Reinforcement Learning

论文作者

Adebola, Simeon, Sharma, Satvik, Shivakumar, Kaushik

论文摘要

已经证明，深层合奏将典型的集体学习中看到的积极效果扩展到神经网络和增强学习（RL）。但是，要提高此类整体模型的效率还有很多事情要做。在这项工作中，我们介绍了在RL（faft）中快速传输的各种合奏，这是一种基于合奏的新方法，用于在高度多模式环境中进行增强学习，并改善了转移到看不见的环境。该算法分为两个主要阶段：合奏成员的培训，以及合成成员的合成（或微调），以在新环境中有效的政策中。该算法的第一阶段涉及并行培训常规的政策梯度或参与者 - 批评者，但增加了鼓励这些政策彼此不同的损失。这会导致单个单峰剂探索最佳策略的空间，并捕获比单个参与者所能捕获更多的环境的多模式。 Deft的第二阶段涉及将组件策略综合为新的策略，该策略以两种方式在修改的环境中效果很好。为了评估DEFT的性能，我们从近端策略优化（PPO）算法的基本版本开始，然后通过faft的修改将其扩展。我们的结果表明，预处理阶段可有效地在多模式环境中产生各种策略。除了替代方案，faft通常会收敛到高奖励的速度要快得多，例如随机初始化而无需faft和合奏成员的微调。虽然当然还有更多的工作来分析理论上的熟练并将其扩展以更加健壮，但我们认为，它为在环境中捕获多模式的框架提供了一个强大的框架，同时仍将使用简单策略表示的RL方法。

Deep ensembles have been shown to extend the positive effect seen in typical ensemble learning to neural networks and to reinforcement learning (RL). However, there is still much to be done to improve the efficiency of such ensemble models. In this work, we present Diverse Ensembles for Fast Transfer in RL (DEFT), a new ensemble-based method for reinforcement learning in highly multimodal environments and improved transfer to unseen environments. The algorithm is broken down into two main phases: training of ensemble members, and synthesis (or fine-tuning) of the ensemble members into a policy that works in a new environment. The first phase of the algorithm involves training regular policy gradient or actor-critic agents in parallel but adding a term to the loss that encourages these policies to differ from each other. This causes the individual unimodal agents to explore the space of optimal policies and capture more of the multimodality of the environment than a single actor could. The second phase of DEFT involves synthesizing the component policies into a new policy that works well in a modified environment in one of two ways. To evaluate the performance of DEFT, we start with a base version of the Proximal Policy Optimization (PPO) algorithm and extend it with the modifications for DEFT. Our results show that the pretraining phase is effective in producing diverse policies in multimodal environments. DEFT often converges to a high reward significantly faster than alternatives, such as random initialization without DEFT and fine-tuning of ensemble members. While there is certainly more work to be done to analyze DEFT theoretically and extend it to be even more robust, we believe it provides a strong framework for capturing multimodality in environments while still using RL methods with simple policy representations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题