论文标题
我的机器人会实现我的目标吗?预测MDP策略达到用户指定的行为目标的概率
Will My Robot Achieve My Goals? Predicting the Probability that an MDP Policy Reaches a User-Specified Behavior Target
论文作者
论文摘要
当自主系统执行任务时,它应维持实现用户目标的概率的校准估计。如果该概率低于某些所需的水平,则应提醒用户,以便可以进行适当的干预措施。本文考虑了将用户目标指定为实现性能摘要的目标间隔的设置,例如以固定的地平线$ H $测量的累积奖励。在\ {0,\ ldots,h-1 \} $的每一次$ t \ in时,我们的方法都会产生对最终累积奖励将落在用户指定的目标间隔$ [y^ - ,y^+]的概率的校准估计值中。$使用此估算,如果可能的概率可以提高可能性下降,则该估算可以提高特定的倾斜度。我们通过反转保形预测来计算概率估计。我们的起点是Romano等人的共形分位回归(CQR)方法,该方法将分裂符号预测应用于分位数回归结果。 CQR不是可逆的,而是通过使用条件累积分布函数(CDF)作为非符合度度量的,我们显示了如何获得可逆修改,我们称之为概率空间共构级分数回归(PCQR)。像CQR一样,PCQR具有有限样本的边际保证,可产生良好的条件预测间隔。通过反转PCQR,我们可以保证自主系统的累积奖励将降至从响应变量的边际分布(即,我们采用的校准CDF估计值)中采样的阈值以预测用户指定目标间隔的覆盖范围。在两个域上进行的实验证实了这些概率是良好的。
As an autonomous system performs a task, it should maintain a calibrated estimate of the probability that it will achieve the user's goal. If that probability falls below some desired level, it should alert the user so that appropriate interventions can be made. This paper considers settings where the user's goal is specified as a target interval for a real-valued performance summary, such as the cumulative reward, measured at a fixed horizon $H$. At each time $t \in \{0, \ldots, H-1\}$, our method produces a calibrated estimate of the probability that the final cumulative reward will fall within a user-specified target interval $[y^-,y^+].$ Using this estimate, the autonomous system can raise an alarm if the probability drops below a specified threshold. We compute the probability estimates by inverting conformal prediction. Our starting point is the Conformalized Quantile Regression (CQR) method of Romano et al., which applies split-conformal prediction to the results of quantile regression. CQR is not invertible, but by using the conditional cumulative distribution function (CDF) as the non-conformity measure, we show how to obtain an invertible modification that we call Probability-space Conformalized Quantile Regression (PCQR). Like CQR, PCQR produces well-calibrated conditional prediction intervals with finite-sample marginal guarantees. By inverting PCQR, we obtain guarantees for the probability that the cumulative reward of an autonomous system will fall below a threshold sampled from the marginal distribution of the response variable (i.e., a calibrated CDF estimate) that we employ to predict coverage probabilities for user-specified target intervals. Experiments on two domains confirm that these probabilities are well-calibrated.