论文标题
用辅助输入的代理国建筑
Agent-State Construction with Auxiliary Inputs
论文作者
论文摘要
在许多情况下,即使不是每项现实的顺序决策任务,决策代理人都无法对世界的全部复杂性进行建模。环境通常比代理大得多,更复杂,这也称为部分可观察性。在这种情况下,代理不仅要利用当前的感觉输入。它必须构建一个代理状态,总结了与世界以前的互动。当前,解决此问题的一种流行方法是通过从代理的感觉流以输入作为输入来学习代理状态函数。相反,许多令人印象深刻的强化学习应用程序依赖于特定环境的功能,以帮助代理商的投入以进行历史汇总。这些增强是通过多种方式完成的,从串联观测等简单方法到更复杂的观测,例如不确定性估计。尽管在该领域中无处不在,但我们认为辅助输入的这些其他输入很少被强调,尚不清楚它们的作用或影响是什么。在这项工作中,我们进一步探讨了这一想法,并将这些辅助输入与先前的经典结构方法联系起来。我们提出了一系列示例,说明了使用辅助输入进行增强学习的不同方法。我们表明,这些辅助输入可用于区分否则会被混合的观测值,从而导致更具表现力的特征,这些特征在不同状态之间平稳插值。最后,我们表明这种方法与最新方法(例如经常性神经网络和随着时间的流逝截断的背部传播)相辅相成,并充当启发时间,可促进更长的时间信用分配,从而提高绩效更好。
In many, if not every realistic sequential decision-making task, the decision-making agent is not able to model the full complexity of the world. The environment is often much larger and more complex than the agent, a setting also known as partial observability. In such settings, the agent must leverage more than just the current sensory inputs; it must construct an agent state that summarizes previous interactions with the world. Currently, a popular approach for tackling this problem is to learn the agent-state function via a recurrent network from the agent's sensory stream as input. Many impressive reinforcement learning applications have instead relied on environment-specific functions to aid the agent's inputs for history summarization. These augmentations are done in multiple ways, from simple approaches like concatenating observations to more complex ones such as uncertainty estimates. Although ubiquitous in the field, these additional inputs, which we term auxiliary inputs, are rarely emphasized, and it is not clear what their role or impact is. In this work we explore this idea further, and relate these auxiliary inputs to prior classic approaches to state construction. We present a series of examples illustrating the different ways of using auxiliary inputs for reinforcement learning. We show that these auxiliary inputs can be used to discriminate between observations that would otherwise be aliased, leading to more expressive features that smoothly interpolate between different states. Finally, we show that this approach is complementary to state-of-the-art methods such as recurrent neural networks and truncated back-propagation through time, and acts as a heuristic that facilitates longer temporal credit assignment, leading to better performance.