线性函数近似具有离线RL的统计限制是什么？

论文标题

线性函数近似具有离线RL的统计限制是什么？

What are the Statistical Limits of Offline RL with Linear Function Approximation?

论文作者

Wang, Ruosong, Foster, Dean P., Kakade, Sham M.

论文摘要

离线强化学习旨在利用离线（观察）数据来指导（因果）顺序决策策略的学习。希望的是，离线增强学习以及功能近似方法（处理维度的诅咒）可以提供一种方法来帮助减轻现代顺序决策问题中过度样本复杂性负担。但是，在文献很大程度上由足够的条件组成的情况下，这种更广泛的方法的有效程度尚未得到充分理解。这项工作集中于哪些必要的代表性和分配条件的基本问题，这些条件允许可证明的样本有效的离线增强学习。也许令人惊讶的是，我们的主要结果表明，即使：i）我们具有可靠性，即\ emph {每个}策略的真实价值函数在一组特征中是线性的，并且2）我们的非上市数据在所有功能（在强光谱条件下）都具有良好的覆盖范围（在强光谱条件下），然后任何算法仍然（理论上的算法），以确定范围的任何问题，以确定范围的范围，以确定不限制的范围。 \ emph {any}给定策略。我们的结果表明，除非有明显的条件保持，否则根本无法进行样本效率的离线政策评估。这样的条件包括具有较低的分配变化（即离线数据分布接近要评估的策略的分布）或明显更强的代表性条件（超出可变性）。

Offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of (causal) sequential decision making strategies. The hope is that offline reinforcement learning coupled with function approximation methods (to deal with the curse of dimensionality) can provide a means to help alleviate the excessive sample complexity burden in modern sequential decision making problems. However, the extent to which this broader approach can be effective is not well understood, where the literature largely consists of sufficient conditions. This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning. Perhaps surprisingly, our main result shows that even if: i) we have realizability in that the true value function of \emph{every} policy is linear in a given set of features and 2) our off-policy data has good coverage over all features (under a strong spectral condition), then any algorithm still (information-theoretically) requires a number of offline samples that is exponential in the problem horizon in order to non-trivially estimate the value of \emph{any} given policy. Our results highlight that sample-efficient offline policy evaluation is simply not possible unless significantly stronger conditions hold; such conditions include either having low distribution shift (where the offline data distribution is close to the distribution of the policy to be evaluated) or significantly stronger representational conditions (beyond realizability).

下载PDF全文

下载文献需遵守相关版权规定

论文标题