递归最小二乘优势演员 - 批评算法

论文标题

递归最小二乘优势演员 - 批评算法

Recursive Least Squares Advantage Actor-Critic Algorithms

论文作者

Wang, Yuan, Zhang, Chunyuan, Yu, Tianzong, Ma, Meng

论文摘要

作为深入增强学习中的重要算法，优势演员评论家（A2C）在具有原始像素输入的离散和连续控制任务中已广泛成功，但其样本效率仍然需要提高更多。在传统的强化学习中，参与者 - 批评算法通常使用递归最小二乘（RLS）技术来更新线性函数近似值的参数，以加速其收敛速度。但是，A2C算法很少使用该技术来训练深层神经网络（DNNS）以提高其样品效率。在本文中，我们提出了两种基于RLS的新型A2C算法并研究其性能。两种称为RLSSA2C和RLSNA2C的算法都使用RLS方法来训练评论家网络和参与者网络的隐藏层。他们之间的主要区别是在政策学习步骤中。 RLSSA2C使用普通的一阶梯度下降算法和标准策略梯度来学习策略参数。 RLSNA2C使用kronecker因近似值，RLS方法和自然策略梯度来学习兼容参数和策略参数。此外，我们分析了两种算法的复杂性和收敛性，并提出了三个技巧，以进一步提高其收敛速度。最后，我们证明了这两种算法在Atari 2600环境中的40场比赛中的有效性和在Mujoco环境中的11个任务。从实验结果中可以看出，在大多数游戏或任务上，我们的两种算法都比香草A2C具有更好的样品效率，并且比其他两种最先进的算法具有更高的计算效率。

As an important algorithm in deep reinforcement learning, advantage actor critic (A2C) has been widely succeeded in both discrete and continuous control tasks with raw pixel inputs, but its sample efficiency still needs to improve more. In traditional reinforcement learning, actor-critic algorithms generally use the recursive least squares (RLS) technology to update the parameter of linear function approximators for accelerating their convergence speed. However, A2C algorithms seldom use this technology to train deep neural networks (DNNs) for improving their sample efficiency. In this paper, we propose two novel RLS-based A2C algorithms and investigate their performance. Both proposed algorithms, called RLSSA2C and RLSNA2C, use the RLS method to train the critic network and the hidden layers of the actor network. The main difference between them is at the policy learning step. RLSSA2C uses an ordinary first-order gradient descent algorithm and the standard policy gradient to learn the policy parameter. RLSNA2C uses the Kronecker-factored approximation, the RLS method and the natural policy gradient to learn the compatible parameter and the policy parameter. In addition, we analyze the complexity and convergence of both algorithms, and present three tricks for further improving their convergence speed. Finally, we demonstrate the effectiveness of both algorithms on 40 games in the Atari 2600 environment and 11 tasks in the MuJoCo environment. From the experimental results, it is shown that our both algorithms have better sample efficiency than the vanilla A2C on most games or tasks, and have higher computational efficiency than other two state-of-the-art algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题