沃斯坦（Wasserstein）约束强大的强化学习

论文标题

沃斯坦（Wasserstein）约束强大的强化学习

Robust Reinforcement Learning with Wasserstein Constraint

论文作者

Hou, Linfang, Pang, Liang, Hong, Xin, Lan, Yanyan, Ma, Zhiming, Yin, Dawei

论文摘要

强大的强化学习旨在以一定程度地对环境动态来找到最佳政策。现有的学习算法通常可以通过以启发式方式扰乱当前状态或模拟环境参数来实现鲁棒性，这些方式缺乏对系统动力学的量化鲁棒性（即过渡概率）。为了克服这个问题，我们利用瓦斯坦距离来衡量参考过渡内核的干扰。借助Wasserstein距离，我们能够将过渡内核干扰与状态干扰联系起来，即将无限二维优化问题降低到有限维度的风险感知问题。通过衍生的风险感知的最佳钟形方程，我们显示了最佳稳健策略的存在，为扰动提供了灵敏度分析，然后设计了一种新颖的鲁棒学习算法-Wasserstein robust Adrancust Adract Advantage Actor-Cricit-Cricit-Cricit-Cricit-Cricit-Cricity算法（WRAAC）。在卡车杆环境中验证了所提出的算法的有效性。

Robust Reinforcement Learning aims to find the optimal policy with some extent of robustness to environmental dynamics. Existing learning algorithms usually enable the robustness through disturbing the current state or simulating environmental parameters in a heuristic way, which lack quantified robustness to the system dynamics (i.e. transition probability). To overcome this issue, we leverage Wasserstein distance to measure the disturbance to the reference transition kernel. With Wasserstein distance, we are able to connect transition kernel disturbance to the state disturbance, i.e. reduce an infinite-dimensional optimization problem to a finite-dimensional risk-aware problem. Through the derived risk-aware optimal Bellman equation, we show the existence of optimal robust policies, provide a sensitivity analysis for the perturbations, and then design a novel robust learning algorithm--Wasserstein Robust Advantage Actor-Critic algorithm (WRAAC). The effectiveness of the proposed algorithm is verified in the Cart-Pole environment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题