强化学习中的任务不足探索

论文标题

强化学习中的任务不足探索

Task-agnostic Exploration in Reinforcement Learning

论文作者

Zhang, Xuezhou, ma, Yuzhe, Singla, Adish

论文摘要

有效的探索是强化学习（RL）的主要挑战之一。大多数现有的样品效率算法假定在探索过程中存在单个奖励函数。但是，在许多实际情况下，没有一个基本的奖励功能来指导探索，例如，当代理需要同时学习许多技能或需要平衡多个相互冲突的目标时。为了应对这些挑战，我们提出\ textIt {task-agnostic rl}框架：在探索阶段，代理首先通过探索MDP在没有奖励功能的指导的情况下通过探索MDP来收集轨迹。经过探索，它旨在为$ n $任务找到近乎最佳的策略，鉴于每个任务都用\ textit {采样奖励}增强的收集轨迹。我们提出了一种有效的任务不合时宜的RL算法，\ textsc {ucbzero}，它找到了$ n $ timary任务的$ h^tilde o（\ log log（\ log（n）h^5sa/ε^2）$ n $ untionary任务的$ε$ - 最佳策略。我们还提供了$ω（\ log（n）h^2sa/ε^2）$下限，表明$ \ log $依赖$ n $是不可避免的。此外，当知道地面真相奖励函数时，我们在统计上更容易的设置中提供了\ textsc {ucbzero}的$ n $独立的样本复杂性。

Efficient exploration is one of the main challenges in reinforcement learning (RL). Most existing sample-efficient algorithms assume the existence of a single reward function during exploration. In many practical scenarios, however, there is not a single underlying reward function to guide the exploration, for instance, when an agent needs to learn many skills simultaneously, or multiple conflicting objectives need to be balanced. To address these challenges, we propose the \textit{task-agnostic RL} framework: In the exploration phase, the agent first collects trajectories by exploring the MDP without the guidance of a reward function. After exploration, it aims at finding near-optimal policies for $N$ tasks, given the collected trajectories augmented with \textit{sampled rewards} for each task. We present an efficient task-agnostic RL algorithm, \textsc{UCBZero}, that finds $ε$-optimal policies for $N$ arbitrary tasks after at most $\tilde O(\log(N)H^5SA/ε^2)$ exploration episodes. We also provide an $Ω(\log (N)H^2SA/ε^2)$ lower bound, showing that the $\log$ dependency on $N$ is unavoidable. Furthermore, we provide an $N$-independent sample complexity bound of \textsc{UCBZero} in the statistically easier setting when the ground truth reward functions are known.

下载PDF全文

下载文献需遵守相关版权规定

论文标题