脱机加固学习的超参数选择

论文标题

脱机加固学习的超参数选择

Hyperparameter Selection for Offline Reinforcement Learning

论文作者

Paine, Tom Le, Paduraru, Cosmin, Michi, Andrea, Gulcehre, Caglar, Zolna, Konrad, Novikov, Alexander, Wang, Ziyu, de Freitas, Nando

论文摘要

离线增强学习（RL纯粹来自记录的数据）是在现实世界中部署RL技术的重要途径。但是，通过评估与环境中每个高参数设置相对应的策略，现有的脱机RL的超参数选择方法破坏了离线假设。这种在线执行通常是不可行的，因此破坏了离线RL的主要目的。因此，在这项工作中，我们专注于\ textIt {离线超参数选择}，即从仅给定记录数据的一组使用不同的超参数培训的许多策略中选择最佳策略的方法。通过大规模的经验评估，我们表明：1）离线RL算法对超参数的选择不稳定，2）诸如估计Q值的离线RL算法和方法之类的因素，以及对超副标的策略的仔细选择，因此可以仔细地控制这些因素，从而使这些因素仔细选择，从而可以选择这些因素，从而对策略进行仔细选择，从而可以选择这些因素，从而对其进行选择。总体而言，我们的结果表现出一种乐观的看法，即即使在具有像素观测值，高维动作空间和较长地平线的具有挑战性的任务中，离线高参数选择也可以触及。

Offline reinforcement learning (RL purely from logged data) is an important avenue for deploying RL techniques in real-world scenarios. However, existing hyperparameter selection methods for offline RL break the offline assumption by evaluating policies corresponding to each hyperparameter setting in the environment. This online execution is often infeasible and hence undermines the main aim of offline RL. Therefore, in this work, we focus on \textit{offline hyperparameter selection}, i.e. methods for choosing the best policy from a set of many policies trained using different hyperparameters, given only logged data. Through large-scale empirical evaluation we show that: 1) offline RL algorithms are not robust to hyperparameter choices, 2) factors such as the offline RL algorithm and method for estimating Q values can have a big impact on hyperparameter selection, and 3) when we control those factors carefully, we can reliably rank policies across hyperparameter choices, and therefore choose policies which are close to the best policy in the set. Overall, our results present an optimistic view that offline hyperparameter selection is within reach, even in challenging tasks with pixel observations, high dimensional action spaces, and long horizon.

下载PDF全文

下载文献需遵守相关版权规定

论文标题