主动测量加强学习以最小化观察成本

论文标题

主动测量加强学习以最小化观察成本

Active Measure Reinforcement Learning for Observation Cost Minimization

论文作者

Bellinger, Colin, Coles, Rory, Crowley, Mark, Tamblyn, Isaac

论文摘要

标准加强学习（RL）算法假定，对下一个状态的观察是立即出现的，无代价。然而，在从医疗到科学发现的各种顺序决策任务中，可以进行多种状态观察，每个观察都有相关的成本。我们建议主动测度RL框架（AMRL）作为对该问题的初始解决方案，在该解决方案中，代理商学会了最大化的回报，我们将其定义为奖励的折扣之和减去观察成本的总和。我们的经验评估表明，AMRL-Q代理商能够在在线培训期间同时学习政策和州估计器。在培训期间，代理商自然会将其依赖于昂贵的环境测量的依赖转移到其州估计器，以增加其奖励。它无需损害学习的政策。我们的结果表明，AMRL-Q代理以类似于标准Q学习和Dyna-Q的速率学习。至关重要的是，通过利用主动策略，AMRL-Q获得了较高成本的回报。

Standard reinforcement learning (RL) algorithms assume that the observation of the next state comes instantaneously and at no cost. In a wide variety of sequential decision making tasks ranging from medical treatment to scientific discovery, however, multiple classes of state observations are possible, each of which has an associated cost. We propose the active measure RL framework (Amrl) as an initial solution to this problem where the agent learns to maximize the costed return, which we define as the discounted sum of rewards minus the sum of observation costs. Our empirical evaluation demonstrates that Amrl-Q agents are able to learn a policy and state estimator in parallel during online training. During training the agent naturally shifts from its reliance on costly measurements of the environment to its state estimator in order to increase its reward. It does this without harm to the learned policy. Our results show that the Amrl-Q agent learns at a rate similar to standard Q-learning and Dyna-Q. Critically, by utilizing an active strategy, Amrl-Q achieves a higher costed return.

下载PDF全文

下载文献需遵守相关版权规定

论文标题