乐观的线性支持和后继功能作为最佳政策转移的基础

论文标题

乐观的线性支持和后继功能作为最佳政策转移的基础

Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer

论文作者

Alegre, Lucas N., Bazzan, Ana L. C., da Silva, Bruno C.

论文摘要

在许多实际应用程序中，强化学习（RL）代理可能必须解决多个任务，每个任务通常都是通过奖励功能建模的。如果奖励功能线性表示，并且代理商以前已经学会了一组针对不同任务的策略，则可以利用后继功能（SFS）来组合此类策略并确定有关新问题的合理解决方案。但是，确定的解决方案不能保证是最佳的。我们介绍了一种解决此限制的新颖算法。它允许RL代理结合现有政策并直接确定任意新问题的最佳政策，而无需与环境进行任何进一步的互动。我们首先（在轻度假设下）表明，SFS解决的转移学习问题等同于学习在RL中优化多个目标的问题。然后，我们引入了基于SF的乐观线性支持算法的扩展，以学习一组SFS构成凸面覆盖范围集的策略。我们证明，该集合中的策略可以通过广义策略改进组合，以构建任何可表达的新任务的最佳行为，而无需任何其他培训样本。我们从经验上表明，在价值函数近似下的离散和连续域中，我们的方法优于最先进的竞争算法。

In many real-world applications, reinforcement learning (RL) agents might have to solve multiple tasks, each one typically modeled via a reward function. If reward functions are expressed linearly, and the agent has previously learned a set of policies for different tasks, successor features (SFs) can be exploited to combine such policies and identify reasonable solutions for new problems. However, the identified solutions are not guaranteed to be optimal. We introduce a novel algorithm that addresses this limitation. It allows RL agents to combine existing policies and directly identify optimal policies for arbitrary new problems, without requiring any further interactions with the environment. We first show (under mild assumptions) that the transfer learning problem tackled by SFs is equivalent to the problem of learning to optimize multiple objectives in RL. We then introduce an SF-based extension of the Optimistic Linear Support algorithm to learn a set of policies whose SFs form a convex coverage set. We prove that policies in this set can be combined via generalized policy improvement to construct optimal behaviors for any new linearly-expressible tasks, without requiring any additional training samples. We empirically show that our method outperforms state-of-the-art competing algorithms both in discrete and continuous domains under value function approximation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题