论文标题
要了解异步优势参与者 - 批评:收敛和线性加速
Towards Understanding Asynchronous Advantage Actor-critic: Convergence and Linear Speedup
论文作者
论文摘要
标准增强学习(RL)算法的异步和并行实施是现代RL巨大成功的关键推动力。在许多异步RL算法中,可以说是最受欢迎和有效的算法是异步的优势参与者-Critic(A3C)算法。尽管A3C已成为RL的主力,但其理论属性仍未得到充分理解,包括其非扰动分析和并行性的性能获得(又称线性速度)。本文重新审视了A3C算法,并确定其非染色收敛保证。在两个I.I.D.下和马尔可夫采样,我们在一般政策近似案例中建立了A3C的本地收敛保证,并在SoftMax策略参数化中建立了全球收敛保证。根据I.I.D.采样,A3C获得了$ \ Mathcal {O}(ε^{ - 2.5}/n)$的样本复杂性,以实现$ε$精度,其中$ n $是工人的数量。与$ \ MATHCAL {O}(ε^{ - 2.5})$的最著名样本复杂性相比,对于两个timesscale AC,A3C A3C实现了\ Emph {LineArepspsip},这证明了第一次是AC算法在AC算法中的平行性和异常的优势。已经提供了有关合成环境,OpenAI健身环境和Atari游戏的数值测试,以验证我们的理论分析。
Asynchronous and parallel implementation of standard reinforcement learning (RL) algorithms is a key enabler of the tremendous success of modern RL. Among many asynchronous RL algorithms, arguably the most popular and effective one is the asynchronous advantage actor-critic (A3C) algorithm. Although A3C is becoming the workhorse of RL, its theoretical properties are still not well-understood, including its non-asymptotic analysis and the performance gain of parallelism (a.k.a. linear speedup). This paper revisits the A3C algorithm and establishes its non-asymptotic convergence guarantees. Under both i.i.d. and Markovian sampling, we establish the local convergence guarantee for A3C in the general policy approximation case and the global convergence guarantee in softmax policy parameterization. Under i.i.d. sampling, A3C obtains sample complexity of $\mathcal{O}(ε^{-2.5}/N)$ per worker to achieve $ε$ accuracy, where $N$ is the number of workers. Compared to the best-known sample complexity of $\mathcal{O}(ε^{-2.5})$ for two-timescale AC, A3C achieves \emph{linear speedup}, which justifies the advantage of parallelism and asynchrony in AC algorithms theoretically for the first time. Numerical tests on synthetic environment, OpenAI Gym environments and Atari games have been provided to verify our theoretical analysis.