重要性抽样放置在非政策时间差异方法中

论文标题

重要性抽样放置在非政策时间差异方法中

Importance Sampling Placement in Off-Policy Temporal-Difference Methods

论文作者

Graves, Eric, Ghiassian, Sina

论文摘要

将许多非政策强化学习算法应用于现实世界问题的核心挑战是重要性抽样引入的差异。在非政策学习中，代理商学习了与执行的政策不同的政策。为了说明差异的重要性抽样比率，但可以增加算法的差异并降低学习率。已经提出了一些重要性抽样的变化来减少差异，每项任务重要性抽样最受欢迎。但是，文献中大多数非政策算法的更新规则偏离了以微妙的方式采样的每项任务重要性。他们纠正整个TD错误，而不仅仅是TD目标。在这项工作中，我们展示了如何将这种轻微的变化解释为TD目标的控制变化，从而降低差异和提高性能。多种算法的实验表明，这种微妙的修饰会改善性能。

A central challenge to applying many off-policy reinforcement learning algorithms to real world problems is the variance introduced by importance sampling. In off-policy learning, the agent learns about a different policy than the one being executed. To account for the difference importance sampling ratios are often used, but can increase variance in the algorithms and reduce the rate of learning. Several variations of importance sampling have been proposed to reduce variance, with per-decision importance sampling being the most popular. However, the update rules for most off-policy algorithms in the literature depart from per-decision importance sampling in a subtle way; they correct the entire TD error instead of just the TD target. In this work, we show how this slight change can be interpreted as a control variate for the TD target, reducing variance and improving performance. Experiments over a wide range of algorithms show this subtle modification results in improved performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题