论文标题
积极的模仿从多个非确定性教师学习:表述,挑战和算法
Active Imitation Learning from Multiple Non-Deterministic Teachers: Formulation, Challenges, and Algorithms
论文作者
论文摘要
我们制定了学习以最低互动成本模仿多个非确定性教师的问题。这个问题的目标不是学习特定的政策,而是要在政策领域中学习分布。我们首先提出一个通用框架,该框架有效地通过学习教师政策的连续表示来有效地对这种分布进行建模和估算。接下来,我们开发了积极的基于绩效的模仿学习(APIL),这是一种活跃的学习算法,用于在此框架中降低学习者的互动成本。通过根据对未来进步的预测做出查询决策,我们的算法避免了面对教师行为不确定性的传统基于不确定性的方法的陷阱。玩具和照片现实的导航任务的结果表明,APIL显着减少了与教师的互动数量,而不会损害绩效。此外,这在各种程度的教师行为不确定性上都是强大的。
We formulate the problem of learning to imitate multiple, non-deterministic teachers with minimal interaction cost. Rather than learning a specific policy as in standard imitation learning, the goal in this problem is to learn a distribution over a policy space. We first present a general framework that efficiently models and estimates such a distribution by learning continuous representations of the teacher policies. Next, we develop Active Performance-Based Imitation Learning (APIL), an active learning algorithm for reducing the learner-teacher interaction cost in this framework. By making query decisions based on predictions of future progress, our algorithm avoids the pitfalls of traditional uncertainty-based approaches in the face of teacher behavioral uncertainty. Results on both toy and photo-realistic navigation tasks show that APIL significantly reduces the numbers of interactions with teachers without compromising on performance. Moreover, it is robust to various degrees of teacher behavioral uncertainty.