论文标题
强盗线性控制
Bandit Linear Control
论文作者
论文摘要
我们考虑在随机噪声,对抗成本和匪徒反馈下控制已知的线性动力学系统的问题。与完整的反馈设置不同,在每个决定之后揭示了整个成本功能,这里只观察到学习者产生的成本。我们提出了一种新的高效算法,为了强烈凸出和平稳的成本,它遗憾的是,随着时间范围的平方根$ t $ t $而增长。我们还将此结果的扩展为一般凸,可能是非平滑成本以及非稳定系统噪声。我们算法的一个关键组成部分是一种新技术,用于通过内存来解决损失功能的强盗优化。
We consider the problem of controlling a known linear dynamical system under stochastic noise, adversarially chosen costs, and bandit feedback. Unlike the full feedback setting where the entire cost function is revealed after each decision, here only the cost incurred by the learner is observed. We present a new and efficient algorithm that, for strongly convex and smooth costs, obtains regret that grows with the square root of the time horizon $T$. We also give extensions of this result to general convex, possibly non-smooth costs, and to non-stochastic system noise. A key component of our algorithm is a new technique for addressing bandit optimization of loss functions with memory.