论文标题
是否存放?有限存储的联合学习的在线数据选择
To Store or Not? Online Data Selection for Federated Learning with Limited Storage
论文作者
论文摘要
机器学习模型已在移动网络中部署,以处理来自不同层的大量数据,以实现自动化网络管理和设备的智能。为了克服集中式机器学习的高度沟通成本和严重的隐私问题,已提出联合学习(FL)来实现网络设备之间的分布式机器学习。尽管已经广泛研究了计算和通信限制,但仍未探索设备存储对FL性能的影响。如果没有有效的数据选择策略来过滤设备上的大量流数据,经典FL可能会遭受更长的模型训练时间($ 4 \ times $)和明显的推理精度降低($ 7 \%$ $),这是我们的实验中观察到的。在这项工作中,我们迈出了第一步,考虑使用有限的在设备存储的FL的在线数据选择。我们首先定义了一个新的数据评估度量标准,用于数据评估和在FL中进行选择,并具有理论保证,以加快模型收敛并同时增强最终模型精度。我们进一步设计{\ ttfamily ode},\ textbf {o} nline \ textbf {d} ata s \ textbf {e} for f for fl fl fl的框架,以协调网络设备以存储有价值的数据样本。一个工业数据集和三个公共数据集的实验结果表明,{\ ttfamily ode}的显着优势比最先进的方法。特别是,在工业数据集上,{\ ttfamily ode}的成就高达$ 2.5 \ times $ $的培训时间和$ 6 \%$ $提高推理准确性,并且在实践环境中对各种因素都有强大的态度。
Machine learning models have been deployed in mobile networks to deal with massive data from different layers to enable automated network management and intelligence on devices. To overcome high communication cost and severe privacy concerns of centralized machine learning, federated learning (FL) has been proposed to achieve distributed machine learning among networked devices. While the computation and communication limitation has been widely studied, the impact of on-device storage on the performance of FL is still not explored. Without an effective data selection policy to filter the massive streaming data on devices, classical FL can suffer from much longer model training time ($4\times$) and significant inference accuracy reduction ($7\%$), observed in our experiments. In this work, we take the first step to consider the online data selection for FL with limited on-device storage. We first define a new data valuation metric for data evaluation and selection in FL with theoretical guarantees for speeding up model convergence and enhancing final model accuracy, simultaneously. We further design {\ttfamily ODE}, a framework of \textbf{O}nline \textbf{D}ata s\textbf{E}lection for FL, to coordinate networked devices to store valuable data samples. Experimental results on one industrial dataset and three public datasets show the remarkable advantages of {\ttfamily ODE} over the state-of-the-art approaches. Particularly, on the industrial dataset, {\ttfamily ODE} achieves as high as $2.5\times$ speedup of training time and $6\%$ increase in inference accuracy, and is robust to various factors in practical environments.