论文标题
非马克维亚奖励模型的在线学习
Online Learning of Non-Markovian Reward Models
论文作者
论文摘要
在某些情况下,代理只有在完成一系列以前的任务后才应获得奖励,也就是说,奖励是非马克维亚人的。一种自然且相当一般的表示与历史相关的奖励的方法是通过Mealy Machine,这是一种有限的状态自动机,可从输入序列产生输出序列。在我们的正式环境中,我们考虑了马尔可夫决策过程(MDP),该过程对代理发展的环境的动力学进行了建模和与此MDP同步的Mealy Machine,以形式化非马克维亚奖励功能。虽然MDP是代理商知道的,但奖励功能是代理商未知的,必须学习。 我们克服这一挑战的方法是使用Angluin的$ l^*$ Active Learning算法来学习代表基础非马克维亚奖励机(MRM)的MEALY机器。正式方法用于确定回答由$ l^*$提出的所谓会员查询的最佳策略。 此外,我们证明,获得的预期奖励最终将至少与域专家提供的合理价值一样多。我们在三个问题上评估了框架。结果表明,在非马克维亚奖励决策过程中使用$ l^*$学习MRM是有效的。
There are situations in which an agent should receive rewards only after having accomplished a series of previous tasks, that is, rewards are non-Markovian. One natural and quite general way to represent history-dependent rewards is via a Mealy machine, a finite state automaton that produces output sequences from input sequences. In our formal setting, we consider a Markov decision process (MDP) that models the dynamics of the environment in which the agent evolves and a Mealy machine synchronized with this MDP to formalize the non-Markovian reward function. While the MDP is known by the agent, the reward function is unknown to the agent and must be learned. Our approach to overcome this challenge is to use Angluin's $L^*$ active learning algorithm to learn a Mealy machine representing the underlying non-Markovian reward machine (MRM). Formal methods are used to determine the optimal strategy for answering so-called membership queries posed by $L^*$. Moreover, we prove that the expected reward achieved will eventually be at least as much as a given, reasonable value provided by a domain expert. We evaluate our framework on three problems. The results show that using $L^*$ to learn an MRM in a non-Markovian reward decision process is effective.