借助部分模型加速深度强化学习：节能预测视频流

论文标题

借助部分模型加速深度强化学习：节能预测视频流

Accelerating Deep Reinforcement Learning With the Aid of Partial Model: Energy-Efficient Predictive Video Streaming

论文作者

Liu, Dong, Zhao, Jianyu, Yang, Chenyang, Hanzo, Lajos

论文摘要

预测能力分配是通过深入的强化学习来构想通过移动网络通过移动网络进行的。目的是在避免视频播放中断的约束下，在完整的视频流中将每个基站的累积能耗最小化。为了处理连续的状态和行动空间，我们诉诸于深层确定性政策梯度（DDPG）算法，以解决该法式问题。与以前的预测能力分配策略相反，该策略首先使用历史数据预测未来的信息，然后根据预测信息优化功率分配，提议的策略以在线和端到端的方式运行。通过明智地设计仅取决于缓慢变化的平均通道增长的动作和状态，我们减少了边缘服务器和基站之间的信号开销，并使学习良好的策略变得更加容易。为了进一步避免在整个学习过程中播放中断并提高收敛速度，我们通过将安全层，决策后状态和虚拟体验的概念集成到基本的DDPG算法中来利用系统动力学的部分知名模型。我们的仿真结果表明，提出的策略会收敛于基于完美的大规模渠道预测而得出的最佳策略，并在存在预测错误的情况下优于首先预测的策略。通过利用部分已知的模型，可以大大提高收敛速度。

Predictive power allocation is conceived for energy-efficient video streaming over mobile networks using deep reinforcement learning. The goal is to minimize the accumulated energy consumption of each base station over a complete video streaming session under the constraint that avoids video playback interruptions. To handle the continuous state and action spaces, we resort to deep deterministic policy gradient (DDPG) algorithm for solving the formulated problem. In contrast to previous predictive power allocation policies that first predict future information with historical data and then optimize the power allocation based on the predicted information, the proposed policy operates in an on-line and end-to-end manner. By judiciously designing the action and state that only depend on slowly-varying average channel gains, we reduce the signaling overhead between the edge server and the base stations, and make it easier to learn a good policy. To further avoid playback interruption throughout the learning process and improve the convergence speed, we exploit the partially known model of the system dynamics by integrating the concepts of safety layer, post-decision state, and virtual experiences into the basic DDPG algorithm. Our simulation results show that the proposed policies converge to the optimal policy that is derived based on perfect large-scale channel prediction and outperform the first-predict-then-optimize policy in the presence of prediction errors. By harnessing the partially known model, the convergence speed can be dramatically improved.

下载PDF全文

下载文献需遵守相关版权规定

论文标题