以有限的优化寻求新颖的政策

论文标题

以有限的优化寻求新颖的政策

Novel Policy Seeking with Constrained Optimization

论文作者

Sun, Hao, Peng, Zhenghao, Dai, Bo, Guo, Jian, Lin, Dahua, Zhou, Bolei

论文摘要

在解决问题的过程中，我们人类可以提出多种针对同一问题的新颖解决方案。但是，增强学习算法只能制定一组单调的政策，这些政策最大化累积奖励，但缺乏多样性和新颖性。在这项工作中，我们解决了在加强学习任务中制定新政策的问题。我们没有遵循现有方法中使用的多目标框架，而是建议在受约束优化的新观点下重新考虑该问题。我们首先引入了一个新的指标，以评估策略之间的差异，然后在新观点下设计两种实用的新型政策生成方法。提出的两种方法，即受约束的任务新三角器（CTNB）和内部策略分化（IPD），源自可行方向方法和约束优化文献中通常已知的内部点方法。关于穆乔科控制套件的实验比较表明，就策略的新颖性及其在原始任务中的表现而言，我们的方法可以比以前的新颖性寻求方法实现实质性改进。

In problem-solving, we humans can come up with multiple novel solutions to the same problem. However, reinforcement learning algorithms can only produce a set of monotonous policies that maximize the cumulative reward but lack diversity and novelty. In this work, we address the problem of generating novel policies in reinforcement learning tasks. Instead of following the multi-objective framework used in existing methods, we propose to rethink the problem under a novel perspective of constrained optimization. We first introduce a new metric to evaluate the difference between policies and then design two practical novel policy generation methods following the new perspective. The two proposed methods, namely the Constrained Task Novel Bisector (CTNB) and the Interior Policy Differentiation (IPD), are derived from the feasible direction method and the interior point method commonly known in the constrained optimization literature. Experimental comparisons on the MuJoCo control suite show our methods can achieve substantial improvement over previous novelty-seeking methods in terms of both the novelty of policies and their performances in the primal task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题