论文标题
六型UCAV空对空战的分层深钢筋学习框架
A Hierarchical Deep Reinforcement Learning Framework for 6-DOF UCAV Air-to-Air Combat
论文作者
论文摘要
无人战斗机(UCAV)战斗是连续动作空间的具有挑战性的情况。在本文中,我们提出了一个一般的分层框架,以解决6度(6-DOF)动力学的范围内的范围内(WVR)空对空战问题。核心思想是将整个决策过程分为两个循环,并使用加强学习(RL)分开解决。外循环考虑了当前的战斗状况,并根据战斗策略决定飞机的预期宏观行为。然后,通过计算飞机的实际输入信号,内部循环通过飞行控制器跟踪宏观行为。我们为外部循环策略和内部循环控制器设计了马尔可夫决策过程,并通过近端策略优化(PPO)算法进行训练。对于内部循环控制器,我们设计了一个有效的奖励功能,以准确跟踪各种宏观行为。对于外部循环策略,我们进一步采用虚拟的自我游戏机制来通过不断与历史策略作斗争来改善战斗绩效。实验结果表明,内部循环控制器比微调的PID控制器可以实现更好的跟踪性能,并且外圈策略可以随着生成的发展而执行复杂的动作以获得越来越高的获胜率。
Unmanned combat air vehicle (UCAV) combat is a challenging scenario with continuous action space. In this paper, we propose a general hierarchical framework to resolve the within-vision-range (WVR) air-to-air combat problem under 6 dimensions of degree (6-DOF) dynamics. The core idea is to divide the whole decision process into two loops and use reinforcement learning (RL) to solve them separately. The outer loop takes into account the current combat situation and decides the expected macro behavior of the aircraft according to a combat strategy. Then the inner loop tracks the macro behavior with a flight controller by calculating the actual input signals for the aircraft. We design the Markov decision process for both the outer loop strategy and inner loop controller, and train them by proximal policy optimization (PPO) algorithm. For the inner loop controller, we design an effective reward function to accurately track various macro behavior. For the outer loop strategy, we further adopt a fictitious self-play mechanism to improve the combat performance by constantly combating against the historical strategies. Experiment results show that the inner loop controller can achieve better tracking performance than fine-tuned PID controller, and the outer loop strategy can perform complex maneuvers to get higher and higher winning rate, with the generation evolves.