零射线的增强学习是否存在？

论文标题

零射线的增强学习是否存在？

Does Zero-Shot Reinforcement Learning Exist?

论文作者

Touati, Ahmed, Rapin, Jérémy, Ollivier, Yann

论文摘要

零拍摄的RL代理是可以在给定环境中解决任何RL任务的代理，在初始无奖励学习阶段之后，立即没有其他计划或学习。这标志着从以奖励为中心的RL范式转变为可以遵循环境中任意说明的“可控”代理。当前的RL代理最多可以解决相关任务的家庭，或者需要重新计划每个任务。建议使用后继功能（SFS）[烧烤+ 18]或前向后表示（FB）表示[TO21]的近似零射击ave策略，但测试受到限制。在阐明了这些方案之间的关系之后，我们引入了改进的损失和新的SF模型，并系统地测试了零射RL方案的可行性，从无监督的RL基准[LYL+21]中系统地对任务进行了系统。为了将通用表示从探索中学习，我们在离线环境中工作，并重复对几个现有重型缓冲区的测试。 SFS似乎遭受了基本状态特征的选择。带有拉普拉斯本征函数的SFS效果很好，而基于自动编码器的SFS，逆切好奇心，过渡模型，低级别的过渡矩阵，对比度学习或多样性（APS）的表现不必要地表现出色。相比之下，FB表示从单个原则标准中共同学习基本和后继特征。他们以零拍的方式以良好的重播缓冲液的方式达到了良好的重播缓冲区的85％，他们的表现最好，始终如一。

A zero-shot RL agent is an agent that can solve any RL task in a given environment, instantly with no additional planning or learning, after an initial reward-free learning phase. This marks a shift from the reward-centric RL paradigm towards "controllable" agents that can follow arbitrary instructions in an environment. Current RL agents can solve families of related tasks at best, or require planning anew for each task. Strategies for approximate zero-shot RL ave been suggested using successor features (SFs) [BBQ+ 18] or forward-backward (FB) representations [TO21], but testing has been limited. After clarifying the relationships between these schemes, we introduce improved losses and new SF models, and test the viability of zero-shot RL schemes systematically on tasks from the Unsupervised RL benchmark [LYL+21]. To disentangle universal representation learning from exploration, we work in an offline setting and repeat the tests on several existing replay buffers. SFs appear to suffer from the choice of the elementary state features. SFs with Laplacian eigenfunctions do well, while SFs based on auto-encoders, inverse curiosity, transition models, low-rank transition matrix, contrastive learning, or diversity (APS), perform unconsistently. In contrast, FB representations jointly learn the elementary and successor features from a single, principled criterion. They perform best and consistently across the board, reaching 85% of supervised RL performance with a good replay buffer, in a zero-shot manner.

下载PDF全文

下载文献需遵守相关版权规定

论文标题