SIM到现实深入强化学习的云边缘培训架构

论文标题

SIM到现实深入强化学习的云边缘培训架构

Cloud-Edge Training Architecture for Sim-to-Real Deep Reinforcement Learning

论文作者

Cao, Hongpeng, Theile, Mirco, Wyrwal, Federico G., Caccamo, Marco

论文摘要

深度强化学习（DRL）是一种有前途的方法，可以通过与环境的互动来学习政策来解决复杂的控制任务。但是，对DRL政策的培训需要大量的培训经验，这使得直接了解物理系统的政策是不切实际的。 SIM到运行的方法可以利用模拟来验证DRL政策，然后将其部署在现实世界中。不幸的是，经过验证的政策的直接现实部署通常由于不同的动态（称为现实差距）而遭受性能恶化。最近的SIM到现实方法，例如域随机化和域的适应性，重点是改善预审预告剂的鲁棒性。然而，经过模拟训练的策略通常需要使用现实世界中的数据来调整以达到最佳性能，这是由于现实世界样本的高成本而具有挑战性的。这项工作提出了一个分布式的云边缘建筑，以实时培训现实世界中的DRL代理。在体系结构中，推理和训练被分配到边缘和云，将实时控制循环与计算昂贵的训练循环分开。为了克服现实差距，我们的体系结构利用了SIM到现实的转移策略，以继续在物理系统上训练模拟预测的代理。我们证明了它在物理倒立螺旋控制系统上的适用性，并分析了关键参数。现实世界实验表明，我们的体系结构可以使验证的DRL代理能够始终如一，有效地看不见动态。

Deep reinforcement learning (DRL) is a promising approach to solve complex control tasks by learning policies through interactions with the environment. However, the training of DRL policies requires large amounts of training experiences, making it impractical to learn the policy directly on physical systems. Sim-to-real approaches leverage simulations to pretrain DRL policies and then deploy them in the real world. Unfortunately, the direct real-world deployment of pretrained policies usually suffers from performance deterioration due to the different dynamics, known as the reality gap. Recent sim-to-real methods, such as domain randomization and domain adaptation, focus on improving the robustness of the pretrained agents. Nevertheless, the simulation-trained policies often need to be tuned with real-world data to reach optimal performance, which is challenging due to the high cost of real-world samples. This work proposes a distributed cloud-edge architecture to train DRL agents in the real world in real-time. In the architecture, the inference and training are assigned to the edge and cloud, separating the real-time control loop from the computationally expensive training loop. To overcome the reality gap, our architecture exploits sim-to-real transfer strategies to continue the training of simulation-pretrained agents on a physical system. We demonstrate its applicability on a physical inverted-pendulum control system, analyzing critical parameters. The real-world experiments show that our architecture can adapt the pretrained DRL agents to unseen dynamics consistently and efficiently.

下载PDF全文

下载文献需遵守相关版权规定

论文标题