文献笔记 - Reinforcement Learning for UAV Attitude Control

这篇博文是自己看文章顺手做的笔记只是简单翻译和整理仅做个人参考学习和分享

如果作者看到觉得内容不妥请联系我我会及时处理

本人非文章作者，文献的引用格式如下，原文更有价值

Koch W, Mancuso R, West R, et al. Reinforcement learning for UAV attitude control[J]. ACM Transactions on Cyber-Physical Systems, 2019, 3(2): 1-21.

摘要——

自动驾驶系统通常包括一个提供稳定性和控制的“内环”，和比如航点导航之类的任务层面的“外环”。无人机的自驾系统主要使用PID控制系统，在稳定的环境下还算好用。但是，更难预测的和复杂环境下需要更复杂的控制器。智能飞行控制系统是一个比较活跃的area，利用强化学习（RL）来解决PID解决不了的问题，在其他领域比如机器人领域取得了不错的进展。但是以前的工作集中在使用RL在任务层面的控制器。本文中，我们探索了使用目前的RL训练方法来实现内环控制，使用DDGP,TRPO和PPO。为了探索这些未知，我们首先开发了一个开源的高可信度的仿真环境来通过RL训练一个四旋翼的姿态控制器。然后使用我们的环境来和PID控制器对比RL是不是更快速，和高精度。

结论——

i）RL可以训练准确的姿态控制器

ii）PPO训练得到的控制器比一个调好的PID控制器几乎每个衡量标准下都更好

尽管在episodic task里面训练的，但是在没训练过的任务中也很好。

表明了使用片段式训练足够用于开发智能姿态控制

I. INTRODUCTION

Using RL it is possible to develop optimal control policies for a UAV without making any assumptions about the aircraft dynamics. Recent work has shown RL to be effective for UAV autopilots, providing adequate path tracking [8].

II. BACKGROUND

A. Quadcopter Flight Dynamics

B. Reinforcement Learning

III. RELATED WORK

However these solutions still inherit disadvantages associated with PID control, such as integral windup, need for mixing, and most significantly, they are feedback controllers and therefore inherently reactive. On the other hand feedforward control (or predictive control) is proactive, and allows the controller to output control signals before an error occur. For feedforward control, a model of the system must exist. Learning-based intelligent control has been proposed to develop models of the aircraft for predictive control using artificial neural networks.

Online learning is an essential component to constructing a complete intelligent flight control system. It is fundamental however to develop accurate offline models to account for uncertainties encountered during online learning [2].

Known as the reality gap, transferring from simulation to the real-world has been researched extensively as being problematic without taking additional steps to increase realism in the simulator [26], [3]

IV. ENVIRONMENT

In this section we describe our learning environment GYM FC for developing intelligent flight control systems using RL. The goal of proposed environment is to allow the agent to learn attitude control of an aircraft with only the knowledge of the number of actuators.

GYM FC has a multi-layer hierarchical architecture composed of three layers: (i) a digital twin layer, (ii) a communication layer, and (iii) an agent-environment interface layer.

A. Digital Twin Layer

At the heart of the learning environment is a high fidelity physics simulator which provides functionality and realism that is hard to achieve with an abstract mathematical model of the aircraft and environment.

For this reason, the simulated environment exposes identical interfaces to actuators and sensors as they would exist in the physical world.

B. Communication Layer

The communication layer is positioned in between the digital twin and the agent-environment interface.

C. Environment Interface Layer

The topmost layer interfacing with the agent is the environment interface layer which implements the OpenAI Gym [10]

Each OpenAI Gym environment defines an observation space and an action space.

Reward engineering can be challenging.For this work, with the goal of establishing a baseline of accuracy, we develop a reward to reflect the current angular velocity error (i.e. e = Ω∗ − Ω).

We translate the current error et at time t into into a derived reward rt normalized between [−1, 0] as follows,

Rewards are normalized to provide standardization and stabilization during training [30].

此外，我们还尝试了各种其他奖励。我们发现稀疏二进制奖励1的性能较差。我们认为这是由于四轴飞行器控制的复杂性造成的。在学习的早期阶段，代理探索其环境。然而，在某个阈值内随机达到目标角速度的事件很少见，因此没有为代理提供足够的信息来收敛。
相反，我们发现每个时间步的信号最好。我们还尝试使用误差的欧几里德范数、二次误差和其他标量值，所有这些都没有提供接近绝对误差之和的性能（方程7）。

V. EVALUATION

In this section we present our evaluation on the accuracy of studied neural-network-based attitude flight controllers trained with RL.

To our knowledge, this is the first RL baseline conducted for quadcopter attitude control.

A. Setup

We evaluate the RL algorithms DDGP, TRPO, and PPO using the implementations in the OpenAI Baselines project [3]. The goal of the OpenAI Baselines project is to establish a reference implementation of RL algorithms, providing baselines for researchers to compare approaches and build upon.

Training and evaluations were run on Ubuntu 16.04 with an eight-core i7- 7700 CPU and an NVIDIA GeForce GT 730 graphics card.

B. Results