Self-supervised DRL with Generalized Computation Graphs for Robot Navigation

Abstract

Learning-based methods improve as the robot acts in the environment, but are difficult to deploy in the real-world due to their high sample complexity. To address the need to learn complex policies with few samples, we propose a generalized computation graph (计算图) that subsumes value-based model-free methods and model-based methods, with specific instantiations interpolating between model-free and model-based (在无模型和有模型方法中进行特定实例化).
We then instantiate this graph to form a navigation model that learns from raw images and is sample efficient. Our simulated car experiments explore the design decisions of our navigation model, and show our approach outperforms single-step and N-step double Q-learning.

Introduction

Not only can learning-based systems lift some of the assumptions of geometric reconstruction methods, but they offer two major advantages that are not present in analytic approaches: (1) learning-based methods adapt to the statistics of the environments in which they are trained (基于学习的方法适用于所训练环境的统计特性) and (2) learning-based systems can learn from their mistakes (基于学习的方法可以从它们犯的错误中学习). The first advantage means that a learning-based navigation system may be able to act more intelligently even under partial observation by exploiting its knowledge of statistical patterns. The second advantage means that, when a learning-based system does make a mistake that results in a failure, the resulting data can be used to improve the system to prevent such a failure from occurring in the future. This second advantage, which is the principal focus of this work, is closely associated with einforcement learning: algorithms that learn from trial-and-error experience (算法从’尝试-错误’的经验中学习).
Reinforcement learning methods are typically classified as either model-free or model-based. Value-based model-free approaches learn a function that takes as input a state and action, and outputs the value (i.e., the expected sum of future rewards).
Policy extraction is then performed by selecting the action that maximizes the value function (选一个能最大化值函数的动作). Model-based approaches learn a predictive function that takes as input a state and a sequence of actions, and output future states (有模型的可以直接输出未来状态,因为有运动学方程). Policy extraction is then performed by selecting the action sequence that maximizes the future rewards using the predicted future states (根据预测的未来状态来最大化未来回报). In general, model-free algorithms can learn complex tasks, but are usually sample-inefficient, while model-based algorithms are typically sample-efficient, but have difficulty scaling to complex, high-dimensional tasks.
We explore the intersection between value-based model-free algorithms and model-based algorithms (探索有无模型方法的交叉点) in the context of learning robot navigation policies.
Three contributions are:

  1. Generalized computation graph for reinforcement learning that subsumes value-based model-free methods and model-based methods (结合基于值的有模型和无模型两种方法的广义计算图).
  2. Instantiations of the generalized computation graph for the task of robot navigation, resulting in a suite of hybrid model-free, model-based algorithms (将计算图融合在机器人导航模块中,结果为一个混合算法).
  3. An extensive empirical evaluation (对文中提出的方法进行了广泛的实际测试).

Related work

传统导航方法为SLAM,缺点仍然是受限于无人机尺寸,计算资源等等. 从而引入基于学习的方法(深度学习算法).
Learning-based methods have attempted to address these limitations by learning from data. These supervised learning (深度学习算法) methods include learning: drivable routes and then using a planner [13], near-to-far obstacle detectors [14], reactive controllers on top of a map-based planner [15], driving affordances [16], and end-to-end driving from demonstrations [17], [18]. However, the capabilities of powerful and expressive models like deep neural networks are often constrained in large part by the available data, and methods based on human supervision are inherently limited by the amount of human data available. (传统的深度学习算法主要是监督学习,完成导航任务的能力受限于大量的样本数据.)
本文提出的深度强化学习算法采用了自监督(self-supervised)方法,直接从真实环境中学习.理论上,这个系统在生命周期可以不间断的自主学习并提升能力.
While these methods have been used to learn robot navigation policies, they often require simulation experience [19], [20]. In contrast, our approach learns from scratch to navigate using monocular images solely in the real-world.
相比于一些方法需要从仿真环境中学习经验,我们的方法使用真实环境中的单目图像从零开始学习导航.

Preliminaries

Our goal is to learn collision avoidance policies (障碍规避策略) for mobile robots. We formalize this task as a reinforcement learning problem, where the robot is rewarded for collision-free navigation.
In reinforcement learning, the goal is to learn a policy that choose actions a t A {a_{t} \in A} at each time step t {t} in response to the current state s t S {s_{t} \in S} , such that the total expected sum of discounted rewards is maximized over all time.
At each time step, the system transitions from s t {s_{t}} to s t + 1 {s_{t+1}} in response to the chosen action a t {a_{t}} (这个动作由策略给出,该动作最大化Q-table值) and the transition probability T ( s t + 1 s t , a t ) {T(s_{t+1}|s_{t},a_{t})} , collecting a reward r t {r_{t}} according to the reward function R ( s t , a t ) {R({s_{t},a_{t}})} .
The expected sum of discounted rewards is then defined as E π , T [ t = t γ t t r t s t , a t ] {E_{\pi,T}[\sum_{t'=t}^{\infty} \gamma ^{t'-t}r_{t'}|s_{t},a_{t}]} , where γ [ 0 , 1 ] {\gamma \in [0,1]} is a discount factor that prioritizes near-term rewards over distant rewards, and the expectation is taken under the transition function T {T} and a policy π {\pi} .

Value-based model-free reinforcement learning

Value-based model-free algorithms learn a value function in order to select which actions to take.
The standard parametric Q-function, Q θ ( s , a ) {Q_{\theta}(s,a)} , is a function of the current state and a single action, and outputs the expected discounted sum of future rewards that will be received by the optimal policy after taking action a {a} in state s {s} , where θ {\theta} denotes the function parameters.
A standard method for learning the Q-function is to minimize the Bellman error, given by
ε t ( θ ) = 1 2 E s , a [ r t + γ V t + 1 Q θ ( s t , a t ) 2 ] , \varepsilon_{t}(\theta)=\frac{1}{2} E_{s,a}[\| r_{t}+\gamma V_{t+1} - Q_{\theta}({s_{t},a_{t})} \|^{2}],
where the actions are sampled from π ( s ) {\pi(\cdot|s)} and the V t + 1 {V_{t+1}} term is known as the bootstrap.
Defining the N-step value as V t ( N ) = n = 0 N 1 γ n r t + n + γ N V t + N , {V_{t}^{(N)}=\sum_{n=0}^{N-1}\gamma^{n}r_{t+n}+\gamma^{N}V_{t+N}}, we argument the standard Bellman error minimization object by considering a weighted combination of Bellman errors from horizon length 1 1 to N N :
ε t ( θ ) = 1 2 E s , a [ N = 1 N ω N V t N Q θ ( s t , a t ) 2 ] : N = 1 N ω N V t N = 1. \varepsilon_{t}(\theta)=\frac{1}{2} E_{s,a}\left[ \| \sum_{N'=1}^{N}\omega_{N'}V_{t}^{N'} - Q_{\theta}({s_{t},a_{t})} \|^{2} \right]:\sum_{N'=1}^{N}\omega_{N'}V_{t}^{N'}=1.

Comparing model-free and model-based methods

在三个方面对比model-free和model-based方法:sample efficiency (采样效率), stability (稳定性), final performance (最终表现).
Model-free techniques are often sample inefficient. Specifically, for (N-step) Q-learning, bias from bootstrapping and high variance multi-step returns can lead to slow convergence. Furthermore, Q-leaning often requires experience replay buffers and target networks for stable learning, which also further decreases sample efficiency.
基于无模型的强化学习方法面临的主要问题就是采样效率过低(Q-learning需要经验回放和给定目标网络来进行训练),并且收敛较慢.
In contrast, model-based methods can be very sample efficient and stable, since learning the transition model reduces to supervised learning of dense time-series data [8].
相反,基于模型的方法是有效且稳定的,因为学习转移模型变成了从稠密时间序列数据中的监督学习.然后最终的表现可能不理想,因为最大化转移模型的精度仅是一个替代目标,这并不意味中策略会表现得很好.

A generalized computation graph for RL

在这里插入图片描述
The computation graph G θ ( s t , A t H ) {G_{\theta}(s_{t},A_{t}^{H})} parameterized by vector θ {\theta} takes as input the current state s t {s_{t}} and a sequence of H actions A t H = ( a t , , a t + H 1 ) {A_{t}^{H}=\left( a_{t}, \cdots, a_{t+H-1} \right)} and produces H sequential predicted outputs Y ^ t H = ( y ^ t , , y ^ t + H 1 ) {\hat{Y}_{t}^{H}=(\hat{y}_{t}, \dots, \hat{y}_{t+H-1})} and a predicted terminal output b ^ t + h {\hat{b}_{t+h}} . These predicted outputs Y ^ t H \hat{Y}_{t}^{H} and b ^ t + h {\hat{b}_{t+h}} are combined and compared with label Y t H {Y}_{t}^{H} and b t + h {{b}_{t+h}} to form an error signal ε t ( θ ) \varepsilon_{t}(\theta) that is minimized using an optimizer.
We first instantiate the computation graph for N-step Q-learning by letting y {y} be reward and b {b} be the future value estimate; setting the model horizon H = 1 H=1 and using N-step returns; and letting the error function be the Bellman error: ε t ( θ ) = ( y ^ t + γ b ^ t + 1 ) ( n = 0 N 1 γ n y t + n + γ N b t + N ) 2 2 . {\varepsilon_{t}(\theta)=\| \left( \hat{y}_{t} + \gamma \hat{b}_{t+1} \right) - \left( \sum_{n=0}^{N-1} \gamma^{n}y_{t+n} + \gamma^{N}b_{t+N} \right) \|_{2}^{2}}.
We define J ( s t , A t H ) J(s_{t},A_{t}^{H}) to be the generalized policy evaluation function, which is a scalar function such as π ( A H s t ) = arg max A H J ( s t , A t H ) \pi(A^{H}|s_{t}) = \arg\max_{A^{H}} J(s_{t},A_{t}^{H}) . For N-step Q-learning, J ( s t , A t H ) = y ^ t + γ b ^ t + 1 J(s_{t},A_{t}^{H}) = \hat{y}_{t} + \gamma \hat{b}_{t+1} is the estimated future value.
在这里插入图片描述

Learning navigation policies with self-supervison

Model parameterization

While many function approximators could be used to instantiate our generalized computation graph, the function approximator needs to be able to cope with high-dimensional state inputs, such as images, and accurately model sequential data due to the nature of robot navigation. We therefore parameterize the computation graph as a deep recurrent neural network (RNN), depicted in Fig. 3.
对于处理高维度状态输入的函数近似,以及连续数据的精准建模问题,采用循环神经网络(RNN)来参数化计算图.RNN网络中代表模型为LSTM.
在这里插入图片描述

Model outputs

We consider two quantities. The first quantity is the standard approach in the reinforcement learning literature: Y ^ t H \hat{Y}_{t}^{H} represent rewards and b ^ t + H \hat{b}_{t+H} represents the future value-to-go. For the task of collision-free navigation, we define the reward as the robot’s speed which is typically known using onboard sensors, and therefore the value is approximately the distance the robots travels before experiencing a collision. The second quantity Y ^ t H \hat{Y}_{t}^{H} represents the probability of collision at or before each timestep-that is, y ^ t + H \hat{y}_{t+H} is the probability the robot will collide between time t t and t + h t+h , and b ^ t + H \hat{b}_{t+H} represents the best-case future likelihood of collision.

Policy evaluation function

If the model output quantities are values, which in our case is the expected distance-to-travel, then the policy evaluation function is simply the value J ( s t , A t H ) = h = 0 H 1 γ h y ^ t + h + γ H b ^ t + H . J(s_{t},A_{t}^{H}) = \sum_{h=0}^{H-1} \gamma^{h} \hat{y}_{t+h} + \gamma^{H} \hat{b}_{t+H} .
如果模型输出的定量为值,那么策略评估函数是N步内的回报加上N+1步的状态值函数,这里可以理解成为Q-value function.
If the model output quantities are collision probabilities, then the policy evaluation function needs to somehow encourage the robot to move through the environment. We assume that the robot will be travelling at some fixed speed, and therefore the policy evaluation function needs to evaluate which actions are least likely to result in collisions J ( s t , A t H ) = h = 0 H 1 y ^ t + h b ^ t + H . J(s_{t},A_{t}^{H}) = \sum_{h=0}^{H-1} -\hat{y}_{t+h} - \hat{b}_{t+H} .
如果模型输出的定量为碰撞概率,那么策略评估函数需要鼓励机器人在环境内移动,我们假设机器人在环境中以恒定速度运动,因此策略评估函数就需要评估哪个动作可以最不可能导致碰撞.

Policy evaluation

Using the policy evaluation function, action selection is performed by solving the finite-horizon planning problem arg max A H J ( s t , A H ) \arg\max_{A^{H}}J(s_{t},A^{H}) .
根据策略评估函数 J ( s t , A t H ) J(s_{t},A_{t}^{H}) 的结果,贪心地(Greedy)选择最大化策略评估函数的动作序列 A H A^{H} .

Model horizon

H = 1 H=1 : 问题简化为完全无模型(fully model-free). H = H=\infty : 问题简化为有模型(fully model-based). For intermediate values of H H , the model is a hybrid of model-free and model-based methods. We empirically evaluate different horizon values in our experiments.

Label horizon

增大label horizon可以加快学习速度,但是会使N-step Q-learning变成on policy算法,这在机器人学习导航时是不理想的,因为我们在训练要同时训练各种类型的数据,包括由旧策略和被探索策略收集到的数据.

Bootstrapping

在label horizon中,我们选择label horizon的数量等于model horizon,为了让模型学习未来结果,除了增加model horizon,另外的方法是bootstrapping. 虽然增加model horizon会使得模型变得更加model-based,但在策略评估过程中,搜索空间成指数倍增长.Bootstrapping可以在model horizon不增大时缓解规划问题,但是bootstrapping会导致学习过程中的偏差(bias)和不稳定性(instability).

Training the model

使用一个数据集 D D 来训练模型,在模型输出(outputs)和标签(labels)之间定义损失函数.
对于samples ( s t H , A t H , y t H ) D {(s_{t}^{H},A_{t}^{H},y_{t}^{H}) \in D} from the dataset, 如果模型输出和标签是值(value),损失函数是标准的Bellman error function ε t ( θ ) = n = 0 N 1 γ n y t + n + γ N b t + H J ( s t , A t H ) 2 2 , \varepsilon_{t}(\theta) = \| \sum_{n=0}^{N-1} \gamma^{n}y_{t+n} + \gamma^{N}b_{t+H} - J(s_{t},A_{t}^{H}) \|_{2}^{2}, in which b t + H = max A H J ( s t , A t H ) b_{t+H}=\max_{A^{H}}J(s_{t},A_{t}^{H}) .
如果模型输出为碰撞概率,损失函数为交叉熵损失在这里插入图片描述

Experiments

实验部分,本文论证了三个问题:

  1. 导航计算图的不同设计选型对性能有什么样的影响?
  2. 在给出最优的设计选型时,我们的方法与之前的方法对比如何?
  3. 我们的方法能够在复杂环境中成功学习真实机器人的导航策略吗?

Robot state S R 2304 S \in R^{2304} is a 64 × 36 64 \times 36 grayscale image taken from an onboard forward-facing camera. The action space A R 1 A \in R^{1} .

Model outputs and loss function

在这里插入图片描述
"value"对应的输出表示了未来回报的期望和,"collision"对应的输出表示了碰撞的概率.回归问题使用的是平均方差损失函数,分类问题使用的是交叉熵损失函数.实验结果表明,collision方法优于value方法,两者的主要差别是:
The value model loss is a single loss on the sum of the outputs, while the collision model is the sum of H {H} separate losses on each of the outputs.
因此,碰撞模型对碰撞标签何时及时出现有额外的监督作用.此外,在样本效率和最终性能方面,具有交叉熵损失的训练明显优于具有均方误差损失的训练. 这一比较表明,预测离散的未来事件比预测连续的折扣奖励更能使学习速度更快,更稳定.虽然我们只是在机器人导航的背景下展示了这一发现,但这一见解可能会产生一类新的高效,稳定,高性能的强化学习算法.

Model horizon

接下来对于 H H 不同时的影响, H = 1 H=1 时,机器人能看见 0.5 m 0.5m , H = 16 H=16 时,机器人能看见 8 m 8m .
在这里插入图片描述
对于输出值的模型,具有更长的周期的训练更稳定,并导致更高的执行最终策略。longer horizon model(长视野模型)表现更好是因为long horizon减少了bootstrap的偏差。然而,对于输出碰撞概率的模型,我们在比较短视距模型和长视距模型时没有注意到任何性能上的变化。这可能是由于概率必然在0和1之间有界,从而最小化了bootstrap的偏差。

Bootstrapping

在这里插入图片描述
在对比bootstrapping时选择 H = 16 H=16 ,在不使用bootstrapping时,输出值的模型无法学习,然而输出碰撞概率的模型有很高的采样效率和稳定性.在使用bootstrapping时,输出值的模型的性能比输出碰撞概率的模型差.然而,输出值的模型确实受益于使用bootstrapping.相反,碰撞预测模型不受使用或不使用bootstrapping的影响.综上结果表明,任务通过向前看 H H step(长视距),且不使用bootstrapping是更有优势的.

Comparisons with prior work.

在这里插入图片描述
对比之前的一些方法,包括double Q-learning和N step double Q-learning等,说明该算法优于以前的算法,具有更高的样本效率,稳定性,和高性能.

Real-world results

We therefore made the system fully asynchronous: the car continuously runs the reinforcement learning algorithm and sends data to the laptop, while the laptop continuously trains the model and periodically sends updated model parameters to the car.
我们构建了异步系统,车持续运行强化学习算法并将数据发送至pc,然后pc持续的训练模型并且将模型参数更新到车上.
In evaluating our approach, we chose the best design decisions from our simulation experiments: the model outputs are collision probabilities trained using classification, a large model horizon (H = 12, corresponding to 3.6m lookahead), and no bootstrapping. All other settings were the exact same as the simulation experiments.
在评估阶段,我们选择了仿真结果中的最佳设计决策:模型输出是使用分类训练的碰撞概率,长视距,无自举.
实验和以前double Q-learning进行对比,说明了算法的优势.

发布了63 篇原创文章 · 获赞 50 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/qq_38649880/article/details/103081089
drl