Virtual-to-real DRL: Continuous Control of Mobile Robots for Mapless Navigation

Abstract

We present a learning-based mapless motion planner (基于学习方法的无地图运动规划器) by taking the sparse 10-dimensional range findings(稀疏的雷达scan) and the target position with respect to the mobile robot coordinate frame as input and the continuous steering commands(连续的控制信号) as output.
We show that, through an asynchronous deep reinforcement learning method, a mapless motion planner can be trained end-to-end without any manually designed features and prior demonstrations. (通过异步深度强化学习方法,一个无地图的运动规划器可以端到端的被训练,不借助任何人工设计的特征和先验知识.)

Introduct

Deep reinforcement learning in mobile robots: The applications of deep reinforcement learning in robotics are mostly limited in manipulation where the workspace is fully observable and stable (DRL算法通常被用在完全可观测和稳定的系统中). In terms of mobile robots, the complicated environments enlarge the sample space extremely while deep-RL methods normally sample the action from a discrete space to simplify the problem (在移动机器人中,复杂环境扩大了样本空间,而DRL方法从离散空间中采样动作来简化问题). Thus, in this paper, we focus on the navigation problem of nonholonomic mobile robots with continuous control of deep-RL, which is the essential ability for the most widely used robot (本文主要研究具有DRL算法的非完整移动机器人的导航问题，这是目前应用最广泛的机器人必须具备的基本能力).
Mapless Navigation:
For mobile nonholonomic ground robots, traditional methods, like simultaneous localization and mapping (SLAM), handle this problem through the prior obstacle map of the navigation environment based on dense laser range findings (机器人导航的传统方法是基于SLAM,估计自身位置信息,并建立障碍物地图来完成路径规划). There are two less addressed issues for this task: (1) the time-consuming building and updating of the obstacle map, and (2) the high dependence on the precise dense laser sensor for the mapping work and the local costmap prediction.
From virtual to real: The huge difference between the structural simulation environment and the highly complicated real-world environment is the central challenge to transfer the trained model to a real robot directly (强化学习算法应用的困难在于仿真环境和真实环境的转化). In this paper, we only use 10-dimensional sparse range findings as the observation input. This highly abstracted observation was sampled from specific angles of the raw laser range findings based on a trivial distribution. This brings two advantages: the first is the reduction of the gap between the virtual and real environments based on this abstracted observation (基于抽象的观测,减少了虚拟环境和真实环境的差异), and the second is the potential extension to low-cost range sensors with distance information from only 10 directions (为低成本传感器使用DRL方法提供思路).

Related Work

Deep-learning-based navigation: For learning-based obstacle avoidance, deep neural networks have been successfully applied on monocular images and depth images. Chen used semantics information extracted from the image by deep neural networks to decide the behavior of the autonomous vehicle. However, their control commands are simply discrete actions like turn left and turn right which may lead to rough navigation behaviors. (目前基于单目图像和深度图像的深度学习导航算法已经实现,但是动作空间是离散的,简单的.) Regarding learning from demonstrations, Pfeiffer et al. [14] used a deep learning model to map the laser range findings and the target position to the moving commands. Kretzschmar et al. [15] used inverse reinforcement learning methods to make robots interact with humans in a socially compliant way. Such kinds of trained models are highly dependent on the demonstration information. A timeconsuming data collection procedure is also inevitable. (模仿学习也在实际中被应用,缺点是需要大量的样本数据,收集过程有很大难度.)
Deep reinforcement learning: DQN是深度强化学习的代表作,目前基于该方法已经实现很多机器人导航任务,但是原始DQN方法的缺陷是只适用于离散环境. 为了将其延展到连续环境中, Lillicrap et al. [2] proposed deep deterministic policy gradients (DDPG) to use deep neural networks on the actor-critic reinforcement learning method where both the policy and value of the reinforcement learning were represented through hierarchical networks. (Lillicrap提出了DDPG,也就是演员评论家框架,演员是动作的执行者,基于当前学到的策略来选择动作,适用于连续控制;评论家基于值函数对策略进行评估,评估结果用来改进演员的策略.)

对于DRL的训练效率问题: asynchronous deep-RL with multiple sample collection threads working in parallel should improve the training efficiency of the specific policy significantly. Minh optimize the deep-RL with asynchronous gradient descent from parallel on-policy actor-learners (A3C算法是异步DRL的一个典型应用).
Thus, we choose DDPG as our training algorithm. Compared with NAF, DDPG needs less training parameters. And we extend DDPG to an asynchronous version to improve the sampling efficiency.

Motion planner implementation

Asynchronous DRL
对比于有无异步的方法效率.
Problem Definition
This paper aims to provide a mapless motion planner for mobile ground robots. $v_{t}=f(x_{t}, p_{t}, v_{t-1}),$ where ${x_{t}}$ is the observation from the raw sensor information, ${p_{t}}$ is the relative position of the target, and ${v_{t-1}}$ is the velocity of the mobile robot in the last time step. They can be regarded as the instant state ${s_{t}}$ of the mobile robot. The model directly maps the state to the action, which is the next time velocity ${v_{t}}$ .
Network Structure
The problem can be naturally transferred to a reinforcement learning problem. In this paper, we use the extend asynchronous DDPG.

状态输入为14维向量,laser在 $[-\frac{\pi}{2},\frac{\pi}{2}]$ , in actor network, after 3 fully-connected neural network layers with 512 nodes, the input vector is transferred to the linear and angular velocity commands of the mobile robot.
For the critic-network, the Q-value of the state and action pair is predicted. We still use 3 fully-connected neural network layers to process the input state. The action is merged in the second fully-connected neural network layers. The Q-value is finally activated through a linear activation function: ${y=kx+b},$ where ${x}$ is the input of the last year, ${y}$ is the predicted Q-value, and ${k}$ and ${b}$ are the trained weights and bias of this layer.
Reward function definition

Experiments

baseline是基于SLAM的导航方法,具体由move base功能包实现,该算法基于laser scan进行自定位和障碍物地图构建,路径规划算法为全局A星和局部SWD方法.
在这里插入图片描述
为了起到对比效果,在传统方法的输入部分采用了和DRL方法相同的稀疏10-dimension laser scan. 图(a)为完整laser scan下的move base方法,也就是常规的传统方法,图(b)为使用稀疏scan的传统方法结果,黑色是导航失败的位置,图©和(d)是两个不同环境下的DRL结果.
实验结果说明在输入稀疏的条件下,传统方法不如DRL方法成功率高. (该实验没有考虑DRL的泛化能力,基于自己做无人机的经验,传统方法在障碍物种类更多或者环境变化的时候具有更好的泛化性能,尤其当DRL算法遇上了之前没有学习过的场景时,难以做出优良决策.)
实验还分析了一个性能max control frequency, max control frequency反映了运动规划器的查询效率.实验结果中,DRL方法虽然distance多于movebase,但是运行的时间几乎相同.侧面说明,DRL算法在规划过程中速度更快.
在这里插入图片描述

Discussion

在实际测试过程中,发现在Env2中训练的策略要好于Env1中的,是因为Env2中的障碍物更加复杂密集.
基于本文提出的DRL方法生成的运动轨迹较传统方法movebase的结果更加曲折,一个可能的解释是training network没有long-term prediction的能力,引入RNN,LSTM网络可以解决这个问题.
本文的目的不是在于取代传统方法,因为在大范围,高复杂度的环境中,环境的地图可能提供更多的信息用来导航.本文的目标在于提供一个低成本的室内机器人导航方案,该机器人搭载低成本,低精度的传感器.

dckwin

发布了63 篇原创文章 · 获赞 50 · 访问量 2万+

私信关注