改善（自然）参与者批评算法的样品复杂性界限

论文标题

改善（自然）参与者批评算法的样品复杂性界限

Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms

论文作者

Xu, Tengyu, Wang, Zhe, Liang, Yingbin

论文摘要

Actor-Critic（AC）算法是找到强化学习中最佳政策的流行方法。在无限的地平线方案中，最近已经建立了AC和天然参与者（NAC）算法的有限样本收敛速率，但在每种迭代中的独立且相同分布的（I.I.D.）采样和单样本更新下。相反，本文表征了马尔可夫采样下AC和NAC的收敛速率和样本复杂性，每次迭代的微型批量数据以及Actor具有一般政策类别的近似。我们表明，微型批量AC的总体样本复杂性达到$ε$ - $ - $ - 固定点的固定点可以提高AC的样本复杂性，订单$ \ Mathcal {o}（O}（ε^{ - 1} \ log（1} \ log（1/ε））$的整体样品$ -Ac $ acc的整体样品复杂性 - NAC通过$ \ Mathcal {O}（ε^{ - 1}/\ log（1/ε））$的订单。此外，此工作中特征的AC和NAC的样本复杂性优于政策梯度（PG）和自然政策梯度（NPG）的$ \ Mathcal {o}（O}（（1-γ）^{ - 3} { - 3}）$和$ \ \ \ \ \ \ \ \ \ \ \米}（O}（O}）（（1-γ）^{1-γ）分别。这是第一项理论研究，确定AC和NAC在无限范围内的AC和NPG对PG和NPG的秩序绩效改善，这是由于批评的纳入而进行的。

The actor-critic (AC) algorithm is a popular method to find an optimal policy in reinforcement learning. In the infinite horizon scenario, the finite-sample convergence rate for the AC and natural actor-critic (NAC) algorithms has been established recently, but under independent and identically distributed (i.i.d.) sampling and single-sample update at each iteration. In contrast, this paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling, with mini-batch data for each iteration, and with actor having general policy class approximation. We show that the overall sample complexity for a mini-batch AC to attain an $ε$-accurate stationary point improves the best known sample complexity of AC by an order of $\mathcal{O}(ε^{-1}\log(1/ε))$, and the overall sample complexity for a mini-batch NAC to attain an $ε$-accurate globally optimal point improves the existing sample complexity of NAC by an order of $\mathcal{O}(ε^{-1}/\log(1/ε))$. Moreover, the sample complexity of AC and NAC characterized in this work outperforms that of policy gradient (PG) and natural policy gradient (NPG) by a factor of $\mathcal{O}((1-γ)^{-3})$ and $\mathcal{O}((1-γ)^{-4}ε^{-1}/\log(1/ε))$, respectively. This is the first theoretical study establishing that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.

下载PDF全文

下载文献需遵守相关版权规定

论文标题