论文标题
WD3:驯服深钢筋学习中的估计偏见
WD3: Taming the Estimation Bias in Deep Reinforcement Learning
论文作者
论文摘要
功能近似引起的高估现象是基于价值的增强学习算法(例如深Q-Networks和DDPG)的众所周知的问题,这可能导致次优政策。为了解决这个问题,TD3在一对批评家之间取下最小值。在本文中,我们表明TD3算法在轻度假设中引入了低估偏差。 To obtain a more precise estimation for value function, we unify these two opposites and propose a novel algorithm \underline{W}eighted \underline{D}elayed \underline{D}eep \underline{D}eterministic Policy Gradient (WD3), which can eliminate the estimation bias and further improve the performance by weighting a pair of critics.为了证明WD3的有效性,我们比较了DDPG,TD3和WD3之间的价值函数的学习过程。结果验证了我们的算法确实消除了价值函数的估计误差。此外,我们在连续控制任务上评估了算法。我们观察到,在每个测试任务中,WD3的性能始终优于最先进算法\ footNote的表现,或者至少与匹配的匹配{我们的代码可用at〜 \ href {https://sites.google.com/view/ictai20-wd3/} {https://sites.google.com/view/ictai20-wd3/}。}。
The overestimation phenomenon caused by function approximation is a well-known issue in value-based reinforcement learning algorithms such as deep Q-networks and DDPG, which could lead to suboptimal policies. To address this issue, TD3 takes the minimum value between a pair of critics. In this paper, we show that the TD3 algorithm introduces underestimation bias in mild assumptions. To obtain a more precise estimation for value function, we unify these two opposites and propose a novel algorithm \underline{W}eighted \underline{D}elayed \underline{D}eep \underline{D}eterministic Policy Gradient (WD3), which can eliminate the estimation bias and further improve the performance by weighting a pair of critics. To demonstrate the effectiveness of WD3, we compare the learning process of value function between DDPG, TD3, and WD3. The results verify that our algorithm does eliminate the estimation error of value functions. Furthermore, we evaluate our algorithm on the continuous control tasks. We observe that in each test task, the performance of WD3 consistently outperforms, or at the very least matches, that of the state-of-the-art algorithms\footnote{Our code is available at~\href{https://sites.google.com/view/ictai20-wd3/}{https://sites.google.com/view/ictai20-wd3/}.}.