论文标题

联合神经土匪

Federated Neural Bandits

论文作者

Dai, Zhongxiang, Shu, Yao, Verma, Arun, Fan, Flint Xiaofeng, Low, Bryan Kian Hsiang, Jaillet, Patrick

论文摘要

关于神经背景匪徒的最新著作由于能够利用神经网络(NNS)强大的代表力来进行奖励预测,因此取得了令人信服的表现。上下文匪徒的许多应用都涉及多个代理,他们在没有共享原始观察的情况下进行了协作,从而引起了联合的上下文匪徒的设置。现有的关于联合上下文匪徒的作品依赖于线性或内核匪徒,在建模复杂的现实世界奖励功能时,这些作品可能不足。因此,本文介绍了联合的神经置信度结合(FN-UCB)算法。为了更好地利用联邦设置,FN-UCB采用了两个UCB的加权组合:$ \ text {ucb}^{a} $允许每个代理人其他代理人其他代理人使用其他代理的观察值来加速探索(而无需共享$ \ text {ucb} $ dection {ucb}^nnn nnn,而不是共享nn nnn n nn nnn nnn n nnn n nnn NN NN NN NN NN NN NN NN联邦平均进行监督学习。 Notably, the weight between the two UCBs required by our theoretical analysis is amenable to an interesting interpretation, which emphasizes $\text{UCB}^{a}$ initially for accelerated exploration and relies more on $\text{UCB}^{b}$ later after enough observations have been collected to train the NNs for accurate reward prediction (i.e., reliable exploitation).我们证明了累积遗憾和FN-UCB的交流次数的上等界限,并在经验上证明了其竞争性能。

Recent works on neural contextual bandits have achieved compelling performances due to their ability to leverage the strong representation power of neural networks (NNs) for reward prediction. Many applications of contextual bandits involve multiple agents who collaborate without sharing raw observations, thus giving rise to the setting of federated contextual bandits. Existing works on federated contextual bandits rely on linear or kernelized bandits, which may fall short when modeling complex real-world reward functions. So, this paper introduces the federated neural-upper confidence bound (FN-UCB) algorithm. To better exploit the federated setting, FN-UCB adopts a weighted combination of two UCBs: $\text{UCB}^{a}$ allows every agent to additionally use the observations from the other agents to accelerate exploration (without sharing raw observations), while $\text{UCB}^{b}$ uses an NN with aggregated parameters for reward prediction in a similar way to federated averaging for supervised learning. Notably, the weight between the two UCBs required by our theoretical analysis is amenable to an interesting interpretation, which emphasizes $\text{UCB}^{a}$ initially for accelerated exploration and relies more on $\text{UCB}^{b}$ later after enough observations have been collected to train the NNs for accurate reward prediction (i.e., reliable exploitation). We prove sub-linear upper bounds on both the cumulative regret and the number of communication rounds of FN-UCB, and empirically demonstrate its competitive performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源