论文标题
软件定义的MDP的通用政策
Universal Policies for Software-Defined MDPs
论文作者
论文摘要
我们介绍了一个名为Oracle指导的决策编程的新编程范式,其中一个程序指定了马尔可夫决策过程(MDP),该语言提供了通用的策略。我们原型的一种新的编程语言Dodona,它使用代表非确定选择的原始“选择”表现出这种范式。 Dodona解释器返回一个值或选择点,其中包括无限制编码,原则上必要的所有信息以做出最佳决定。 Meta-therpreters在这些选择点上查询Dodona(神经)的Oracle,以获取政策和价值估计,他们可以在基础MDP上使用这些选择来进行启发式搜索。我们通过在数百项合成任务中进行元学习来展示Dodona进行零射启发式指导的潜力,这些任务模拟了列表,树木,教堂数据架构,多项式,一阶术语和高阶术语的基本操作。
We introduce a new programming paradigm called oracle-guided decision programming in which a program specifies a Markov Decision Process (MDP) and the language provides a universal policy. We prototype a new programming language, Dodona, that manifests this paradigm using a primitive 'choose' representing nondeterministic choice. The Dodona interpreter returns either a value or a choicepoint that includes a lossless encoding of all information necessary in principle to make an optimal decision. Meta-interpreters query Dodona's (neural) oracle on these choicepoints to get policy and value estimates, which they can use to perform heuristic search on the underlying MDP. We demonstrate Dodona's potential for zero-shot heuristic guidance by meta-learning over hundreds of synthetic tasks that simulate basic operations over lists, trees, Church datastructures, polynomials, first-order terms and higher-order terms.