加固学习中的分布鲁棒性和正则化

论文标题

加固学习中的分布鲁棒性和正则化

Distributional Robustness and Regularization in Reinforcement Learning

论文作者

Derman, Esther, Mannor, Shie

论文摘要

分布稳健优化（DRO）使得能够证明分类和回归中鲁棒性与正则化之间的等效性，从而提供了一个分析原因，为什么正则化在统计学习中很好地概括了。尽管Dro的扩展到顺序决策的扩展通过强大的马尔可夫决策过程（MDP）设置克服了$ \ textit {外部不确定性} $，但最终的公式很难解决，尤其是在大域上。另一方面，强化学习中的现有正规化方法仅解决$ \ textit {内部不确定性} $，这是由于随机性引起的。我们的研究旨在通过建立强大的MDP与正则化之间的双重关系来促进强大的增强学习。我们介绍了Wasserstein的分布在强大的MDP上，并证明它们具有样本外的性能保证。然后，我们引入了一种新的经验价值函数正常化程序，并表明它降低了Wasserstein分布在稳健的值函数。我们将结果扩展到大状态空间的线性值函数近似。我们的方法提供了鲁棒性的替代表述，并保证了有限样本的性能。此外，它建议使用正则化作为一种实用工具来处理加强学习方法中的$ \ textit {外部不确定性} $。

Distributionally Robust Optimization (DRO) has enabled to prove the equivalence between robustness and regularization in classification and regression, thus providing an analytical reason why regularization generalizes well in statistical learning. Although DRO's extension to sequential decision-making overcomes $\textit{external uncertainty}$ through the robust Markov Decision Process (MDP) setting, the resulting formulation is hard to solve, especially on large domains. On the other hand, existing regularization methods in reinforcement learning only address $\textit{internal uncertainty}$ due to stochasticity. Our study aims to facilitate robust reinforcement learning by establishing a dual relation between robust MDPs and regularization. We introduce Wasserstein distributionally robust MDPs and prove that they hold out-of-sample performance guarantees. Then, we introduce a new regularizer for empirical value functions and show that it lower bounds the Wasserstein distributionally robust value function. We extend the result to linear value function approximation for large state spaces. Our approach provides an alternative formulation of robustness with guaranteed finite-sample performance. Moreover, it suggests using regularization as a practical tool for dealing with $\textit{external uncertainty}$ in reinforcement learning methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题