连续控制与深度强化学习

CONTINUOUS CONTROL WITH DEEP REINFORCEMENT LEARNING

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez,Yuval Tassa, David Silver & Daan Wierstra

Abstract

我们将Deep Q-Learning成功的基础思想应用于连续的action领域。提出了一种基于确定性策略梯度的actor-critic,model-free算法，该算法可以在连续的动作空间中运行。使用相同的学习算法，网络架构和超参数，我们的算法可以有力地解决20多个模拟物理任务，包括经典问题，如推车摆动，灵巧操纵，腿式运动和汽车驾驶。我们的算法能够找到性能与规划算法相比有竞争力的策略，而规划算法可以获得域及其导数的变化。我们进一步证明，对于许多任务，该算法可以学习“端到端”策略:直接从原始像素输入。

We adapt the ideas underlying the success of Deep Q-Learning to the continuous action domain. We present an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. Using the same learning algorithm, network architecture and hyper-parameters, our algorithm robustly solves more than 20 simulated physics tasks, including classic problems such as cartpole swing-up, dexterous manipulation, legged locomotion and car driving. Our algorithm is able to find policies whose performance is competitive with those found by a planning algorithm with full access to the dynamics of the domain and its derivatives. We further demonstrate that for many of the tasks the algorithm can learn policies “end-to-end”: directly from raw pixel inputs.

1 INTRODUCTION

人工智能领域的主要目标之一是通过未处理的、高维的、感官输入来解决复杂的任务。最近，通过将感知处理的深度学习(Krizhevsky et al., 2012)¹ 的进展与强化学习相结合，取得了重大进展，从而产生了"Deep Q Network"（DQN）算法(Mnih et al., 2015)²，能够在许多使用未处理像素输入的Atari视频游戏中达到人类级别的表现。为此，采用深度神经网络函数逼近器对action-value函数进行估计。

One of the primary goals of the field of artificial intelligence is to solve complex tasks from unprocessed, high-dimensional, sensory input. Recently, significant progress has been made by combining advances in deep learning for sensory processing (Krizhevsky et al., 2012) with reinforcement learning, resulting in the “Deep Q Network” (DQN) algorithm (Mnih et al., 2015) that is capable of human level performance on many Atari video games using unprocessed pixels for input. To do so, deep neural network function approximators were used to estimate the action-value function.

然而，虽然DQN解决了高维环境空间的问题，但它只能处理离散和低维动作空间。许多感兴趣的任务，尤其是物理控制任务，具有连续（实际值）和高维度动作空间。 DQN不能直接应用于连续域，因为它依赖于找到最大化action-value函数的动作，而在连续值情况下，每一步都需要迭代优化过程。

However, while DQN solves problems with high-dimensional observation spaces, it can only handle discrete and low-dimensional action spaces. Many tasks of interest, most notably physical control tasks, have continuous (real valued) and high dimensional action spaces. DQN cannot be straightforwardly applied to continuous domains since it relies on a finding the action that maximizes the action-value function, which in the continuous valued case requires an iterative optimization process at every step.

将深度强化学习方法(如DQN)应用于连续域的一个明显方法是对动作空间进行简单的离散化。然而，这有许多限制，最明显的是维度的诅咒：操作的数量随着自由度的增加呈指数增长。例如，一个7自由度的系统(就像人的手臂上一样)，每个关节的离散化在 $a_i \in {\{-k,0,k\}}$ 中，导致一个维度为: $3^7 = 2187$ 的动作空间。对于需要精细控制操作的任务，情况甚至更糟，因为它们需要相应的更细粒度的离散化，从而导致离散操作数量的激增。如此大的行动空间很难有效地探索，因此在这种情况下成功地训练类似DQN的网络可能是很棘手的。此外，动作空间的朴素离散化不必要地丢弃了关于动作域结构的信息，这对于解决许多问题可能是必不可少的。

An obvious approach to adapting deep reinforcement learning methods such as DQN to continuous domains is to to simply discretize the action space. However, this has many limitations, most notably the curse of dimensionality: the number of actions increases exponentially with the number of degrees of freedom. For example, a 7 degree of freedom system (as in the human arm) with the coarsest discretization $a_i \in {\{-k,0,k\}}$ for each joint leads to an action space with dimensionality: $3^7 = 2187$ . The situation is even worse for tasks that require fine control of actions as they require a correspondingly finer grained discretization, leading to an explosion of the number of discrete actions. Such large action spaces are difficult to explore efficiently, and thus successfully training DQN-like networks in this context is likely intractable. Additionally, naive discretization of action spaces needlessly throws away information about the structure of the action domain, which may be essential for solving many problems.

在这项工作中，我们提出了一种model-free, off-policy actor-critic算法，它使用了能够在高维连续动作空间中学习策略的深层函数逼近器。我们的工作基于确定性策略梯度(deterministic policy gradient, DPG)算法(Hafner & Riedmiller, 2011)³(本身类似于NFQCA (Hafner & Riedmiller, 2011)，类似的思想可以在(Prokhorov et al., 1997)⁴中找到)。然而，正如我们下面所示，对于具有挑战性的问题，这种带有神经函数逼近器的actor-critic方法的一种幼稚的应用是不稳定的。

In this work we present a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces. Our work is based on the deterministic policy gradient (DPG) algorithm (Silver et al., 2014) (itself similar to NFQCA (Hafner & Riedmiller, 2011), and similar ideas can be found in (Prokhorov et al., 1997)). However, as we show below, a naive application of this actor-critic method with neural function approximators is unstable for challenging problems.

在这里，我们将actor-critic方法与Deep Q Network（DQN）近期成功的见解相结合(Mnih et al., 2013; 2015)⁵ ²。在DQN之前，人们普遍认为使用大型非线性函数逼近器学习值函数是困难和不稳定的。由于两项创新，DQN能够以稳定和稳健的方式使用这样的函数逼近器来学习价值函数：1. 利用重放缓冲区中的样本对网络进行离线训练，使样本之间的相关性最小化; 2.该网络使用target Q network进行训练，以便在时间差异备份期间提供一致的目标。。在这项工作中，我们使用相同的想法，以及批量规范化（Ioffe＆Szegedy，2015）⁶，这是深度学习的最新进展。

Here we combine the actor-critic approach with insights from the recent success of Deep Q Network (DQN) (Mnih et al., 2013; 2015). Prior to DQN, it was generally believed that learning value functions using large, non-linear function approximators was difficult and unstable. DQN is able to learn value functions using such function approximators in a stable and robust way due to two innovations: 1. the network is trained off-policy with samples from a replay buffer to minimize correlations between samples; 2. the network is trained with a target Q network to give consistent targets during temporal difference backups. In this work we make use of the same ideas, along with batch normalization (Ioffe & Szegedy, 2015), a recent advance in deep learning.

为了评估我们的方法，我们构建了各种具有挑战性的物理控制问题，涉及复杂的多关节运动，不稳定和丰富的接触动力学以及步态行为。其中包括经典问题，如推车问题，以及许多新领域。机器人控制的长期挑战是直接从原始感官输入（如视频）学习行动策略。因此，我们在模拟器中放置了一个固定的视点相机，并尝试使用低维观测(如关节角度)和直接从像素点进行所有任务。

In order to evaluate our method we constructed a variety of challenging physical control problems that involve complex multi-joint movements, unstable and rich contact dynamics, and gait behavior. Among these are classic problems such as the cartpole swing-up problem, as well as many new domains. A long-standing challenge of robotic control is to learn an action policy directly from raw sensory input such as video. Accordingly, we place a fixed viewpoint camera in the simulator and attempted all tasks using both low-dimensional observations (e.g. joint angles) and directly from pixels.

我们称之为Deep DPG (Deep DPG)的无模型方法可以使用相同的超参数和网络结构，通过使用低维观察(例如笛卡尔坐标或关节角度)，为我们的所有任务学习竞争策略。在许多情况下，我们还能够直接从像素中学习好的策略，同样保持超参数和网络结构不变。

Our model-free approach which we call Deep DPG (DDPG) can learn competitive policies for all of our tasks using low-dimensional observations (e.g. cartesian coordinates or joint angles) using the same hyper-parameters and network structure. In many cases, we are also able to learn good policies directly from pixels, again keeping hyperparameters and network structure constant 1.

该方法的一个关键特征是它的简单性：它只需要一个简单的 actor-critic 架构和学习算法，并且只需很少的“移动部件”，使其易于实现和扩展到更复杂的问题和更大的网络。对于物理控制问题，我们将结果与规划者(Tassa et al., 2012)⁷ 计算的基线进行比较，该基线可以完全获得底层模拟动态及其衍生物（参见补充信息）。有趣的是，DDPG有时可以找到超出规划者表现的策略，在某些情况下甚至可以从像素中学习（规划者总是计划在潜在的低维状态空间）。

A key feature of the approach is its simplicity: it requires only a straightforward actor-critic architecture and learning algorithm with very few “moving parts”, making it easy to implement and scale to more difficult problems and larger networks. For the physical control problems we compare our results to a baseline computed by a planner (Tassa et al., 2012) that has full access to the underlying simulated dynamics and its derivatives (see supplementary information). Interestingly, DDPG can sometimes find policies that exceed the performance of the planner, in some cases even when learning from pixels (the planner always plans over the underlying low-dimensional state space).

2.0 BACKGROUND

我们考虑一种标准的强化学习设置，包括与离散时间步长中的环境E交互的agent。在每个时间步 $t$ ，agent接收观察 $x_t$ ，采取行动 $a_t$ 并获得标量奖励 $r_t$ 。在这里考虑的所有环境中，动作都是2 IRN的实值。一般情况下，可以对环境进行部分观察，以便观察的整个历史，可能需要action对 ${s_t = (x_1,a_1,..., a_{t−1},x_t)}$ 来描述状态。这里，我们假设环境是完全观察到的，因此 $s_t = x_t$ 。

We consider a standard reinforcement learning setup consisting of an agent interacting with an environment E in discrete timesteps. At each timestep $t$ the agent receives an observation $x_t$ , takes an action $a_t$ and receives a scalar reward $r_t$ . In all the environments considered here the actions are real-valued at 2 IRN. In general, the environment may be partially observed so that the entire history of the observation, action pairs ${s_t = (x_1,a_1,..., a_{t−1},x_t)}$ may be required to describe the state. Here, we assumed the environment is fully-observed so $s_t = x_t$ .

agent的行为由策略 $π$ 定义，该策略将状态映射到动作 $π: S \to P(A)$ 的概率分布。环境 $E$ ，也可能是随机的。我们将其建模为马尔可夫决策过程，状态空间为 $S$ ，动作空间为 $A = IR^N$ ，初始状态分布为 $p(s_1)$ ，transition动态为 $p(s_{t+1}\mid s_t,a_t)$ 和奖励函数 $r(s_t, a_t)$ 。

An agent’s behavior is defined by a policy, $π$ , which maps states to a probability distribution over the actions $π: S \to P(A)$ . The environment, $E$ , may also be stochastic. We model it as a Markov decision process with a state space $S$ , action space $A = IR^N$ , an initial state distribution $p(s_1)$ , transition dynamics $p(s_{t+1}\mid s_t, a_t)$ , and reward function $r(s_t, a_t)$ .

状态的回报被定义为未来奖励折现的总和 $R_t = \textstyle\sum_{i=t}^T γ^{(i−t)} r(s_i,a_i)$ 具有折扣因子 $γ \in [0,1]$ 。请注意，返回值取决于所选择的操作，因此取决于策略 $π$ ，并且可能是随机的。强化学习的目标是学习一项策略，该策略最大化从开始分配的预期回报 $J = E_{r_i,s_i∼E,a_i∼π} [R_1]$ 。我们将策略 $π$ 的折扣状态分布表示为 $ρ^π$ 。

The return from a state is defined as the sum of discounted future reward $R_t = \textstyle\sum_{i=t}^T γ^{(i−t)} r(s_i,a_i)$ with a discounting factor $γ \in [0,1]$ . Note that the return depends on the actions chosen, and therefore on the policy $π$ , and may be stochastic. The goal in reinforcement learning is to learn a policy which maximizes the expected return from the start distribution $J = E_{r_i,s_i∼E,a_i∼π} [R_1]$ . We denote the discounted state visitation distribution for a policy $π$ as $ρ^π$ .

action-value函数用于许多强化学习算法。它描述了在状态 $s_t$ 中采取行动 $a_t$ 之后的预期回报，并且此后遵循策略 $π$ ：

The action-value function is used in many reinforcement learning algorithms. It describes the expected return after taking an action $a_t$ in state $s_t$ and thereafter following policy $π$ :

$\tag{1}Q^π(s_t,a_t) = E_{r_i\geq t,s_i\gt t∼E,a_i\gt t∼π} [R_t\mid s_t,a_t]$

强化学习中的许多方法都使用称为Bellman方程的递归关系：

Many approaches in reinforcement learning make use of the recursive relationship known as the Bellman equation:

$\tag{2}Q^π(s_t,a_t) = E_{r_t,s_{t+1∼E}}[r(s_t,a_t)+γE_{a_{t+1}∼\pi}[Q^\pi(s_{t+1},a_{t+1})]]$

如果目标策略是确定性的，我们可以将其描述为函数 $\mu: S\gets A$ 并避免内在期望：

If the target policy is deterministic we can describe it as a function $\mu: S\gets A$ and avoid the inner expectation:

$\tag{3}Q^\mu(s_t,a_t) = E_{r_t,s_{t+1∼E}}[r(s_t,a_t)+γQ^\mu(s_{t+1},\mu(s_{t+1})]$

期望仅取决于环境。这意味着可以使用从不同的随机行为策略 $\beta$ 生成的transition来学习 $Q^\mu$ 离线策略。 Q-learning(Watkins & Dayan, 1992)⁸是一种常用的离线策略算法，它使用贪心的策略 $\mu(s) = arg max_a Q(s,a)$ 。我们考虑由 $\theta ^Q$ 参数化的函数逼近器，通过最小化损失对其进行优化:

The expectation depends only on the environment. This means that it is possible to learn $Q^\mu$ offpolicy, using transitions which are generated from a different stochastic behavior policy $\beta$ . Q-learning (Watkins & Dayan, 1992), a commonly used off-policy algorithm, uses the greedy policy $\mu(s) = arg max_a Q(s,a)$ . We consider function approximators parameterized by $\theta ^Q$ , which we optimize by minimizing the loss:

$\tag{4}L(\theta^Q)=E_{s_t\thicksim \rho^\beta,a_t\thicksim\beta,r_t\thicksim E}[(Q(s_t,a_t\mid\theta^Q)-y_t)^2]$

where
$\tag{5}y_t=r(s_t,a_t)+\gamma Q(s_{t+1},\mu(s_{t+1})\mid \theta^Q)$

虽然 $y_t$ 也依赖于 $\theta^Q$ ，但这通常被忽略。

While $y_t$ is also dependent on $\theta^Q$ , this is typically ignored.

过去经常避免使用大的非线性函数逼近器来学习价值或动作值函数，因为理论性能保证是不可能的，并且实际学习往往是不稳定的。最近，(Mnih et al., 2013; 2015)⁵ ²采用了Q-learning算法，以便有效地使用大型神经网络作为函数逼近器。他们的算法能够通过像素学习玩Atari游戏。为了扩展Q-learning，他们引入了两个主要变化：使用重放缓冲区和单独的target network来计算 $y_t$ 。我们在DDPG的背景下使用它们，并在下一节中解释它们的实现。

The use of large, non-linear function approximators for learning value or action-value functions has often been avoided in the past since theoretical performance guarantees are impossible, and practically learning tends to be unstable. Recently, (Mnih et al., 2013; 2015) adapted the Q-learning algorithm in order to make effective use of large neural networks as function approximators. Their algorithm was able to learn to play Atari games from pixels. In order to scale Q-learning they introduced two major changes: the use of a replay buffer, and a separate target network for calculating $y_t$ . We employ these in the context of DDPG and explain their implementation in the next section.

3 ALGORITHM

将Q-learning直接应用于连续操作空间是不可能的，因为在连续空间中，寻找贪婪策略需要在每一步都优化 $a_t$ ; 这种优化速度太慢，不能用于大型、无约束的函数逼近器和非平凡的action空间。相反，我们在这里使用了基于DPG算法的 actor-critic方法(Silver et al., 2014)³。

It is not possible to straightforwardly apply Q-learning to continuous action spaces, because in continuous spaces finding the greedy policy requires an optimization of $a_t$ at every timestep; this optimization is too slow to be practical with large, unconstrained function approximators and nontrivial action spaces. Instead, here we used an actor-critic approach based on the DPG algorithm (Silver et al., 2014).

DPG算法维护一个参数化的actor函数 $\mu(s\mid \theta^\mu)$ ，其通过确定地将状态映射到特定动作来指定当前策略。 critic $Q(s,a)$ 使用Bellman方程学习，就像在Q-learning中一样。通过遵循将链规则应用于初始布 $J$ 相对于actor参数的预期返回来更新actor：

The DPG algorithm maintains a parameterized actor function $\mu(s\mid \theta^\mu)$ which specifies the current policy by deterministically mapping states to a specific action. The critic $Q(s,a)$ is learned using the Bellman equation as in Q-learning. The actor is updated by following the applying the chain rule to the expected return from the start distribution $J$ with respect to the actor parameters:

$\tag{6}\nabla_{\theta ^\mu}J \approx E_{s_t\thicksim \rho^\beta} [\nabla_{\theta ^\mu}Q(s,a\mid \theta^Q)\mid_{s=s_t,a=\mu(s_t\mid \theta^\mu)}]=E_{s_t\thicksim \rho^\beta} [\nabla_{a}Q(s,a\mid \theta^Q)\mid_{s=s_t,a=\mu(s_t)}\nabla_{\theta^\mu}\mu(s\mid \theta^\mu)\mid_{s=s_t}]$

Silver等(2014)证明这是策略梯度，策略执行的梯度 2

Silver et al. (2014) proved that this is the policy gradient, the gradient of the policy’s performance 2.

与Q-learning一样，引入非线性函数逼近器意味着不再保证收敛。然而，为了学习和推广大型状态空间，这些近似似乎是必不可少的。 NFQCA（Hafner＆Riedmiller，2011）⁹使用与DPG相同的更新规则但使用神经网络函数近似器，使用批量学习来实现稳定性，这对大型网络来说是棘手的。 NFQCA的一个小批量版本在每次更新时都不重置策略，这是扩展到大型网络所必需的，它相当于我们在这里比较的原始DPG。我们的贡献是为DPG提供修改，灵感来自于DQN的成功，这使得它可以使用神经网络函数逼近器在线学习大型状态和动作空间。我们将我们的算法称为Deep DPG(DDPG, Algorithm 1)。

As with Q-learning, introducing non-linear function approximators means that convergence is no longer guaranteed. However, such approximators appear essential in order to learn and generalize on large state spaces. NFQCA (Hafner & Riedmiller, 2011), which uses the same update rules as DPG but with neural network function approximators, uses batch learning for stability, which is intractable for large networks. A minibatch version of NFQCA which does not reset the policy at each update, as would be required to scale to large networks, is equivalent to the original DPG, which we compare to here. Our contribution here is to provide modifications to DPG, inspired by the success of DQN, which allow it to use neural network function approximators to learn in large state and action spaces online. We refer to our algorithm as Deep DPG (DDPG, Algorithm 1).

使用神经网络进行强化学习时的一个挑战是大多数优化算法假设样本是独立且相同的分布。显然，当样本是在环境中顺序探索产生的时候，这种假设不再成立。此外，为了有效利用硬件优化，必须在mini-batche中学习，而不是在线学习。

One challenge when using neural networks for reinforcement learning is that most optimization algorithms assume that the samples are independently and identically distributed. Obviously, when the samples are generated from exploring sequentially in an environment this assumption no longer holds. Additionally, to make efficient use of hardware optimizations, it is essential to learn in mini-batches, rather than online.

与DQN一样，我们使用重放缓冲区来解决这些问题。重放缓冲区是有限大小的缓存 $R$ 。根据探索策略从环境中采样transition，并且元组 $(s_t, a_t, r_t, s_{t+1})$ 存储在重放缓冲区中。当重放缓冲区已满时，丢弃最旧的样本。在每个时间步骤，通过从缓冲区均匀地采样小批量来更新actor和critic。由于DDPG是一种离线策略算法，因此重放缓冲区可能很大，从而允许算法通过一组不相关的transition进行学习。

As in DQN, we used a replay buffer to address these issues. The replay buffer is a finite sized cache $R$ . Transitions were sampled from the environment according to the exploration policy and the tuple $(s_t, a_t, r_t, s_{t+1})$ was stored in the replay buffer. When the replay buffer was full the oldest samples were discarded. At each timestep the actor and critic are updated by sampling a minibatch uniformly from the buffer. Because DDPG is an off-policy algorithm, the replay buffer can be large, allowing the algorithm to benefit from learning across a set of uncorrelated transitions.

用神经网络直接实现Q learning（方程式4）在许多环境中被证明是不稳定的。由于正在更新的网络 $Q(s,a\mid\theta^Q)$ 也用于计算target value（等式5），因此更新Q易于发散。我们的解决方案类似于(Mnih et al., 2013)⁵中使用的目标网络，但针对actor-critic和使用“soft” target update进行了修改，而不是直接复制权重。我们分别创建了actor和critic网络的副本， $Q'(s,a\mid\theta^{Q'})$ 和 $µ'(s\mid\theta^{\mu'})$ 用于计算目标值。然后通过让他们慢慢跟踪学习的网络来更新这些目标网络的权重： $θ'\gets \tau \theta + (1 − \tau)θ'$ with $\tau \ll 1$ 。这意味着target value受到限制，变化缓慢，大大提高了学习的稳定性。这种简单的改变使得学习action-value函数的相对不稳定的问题更接近于监督学习的情况，这是一个存在稳健解决方案的问题。我们发现同时拥有目标 $μ'$ 和 $Q'$ 需要有稳定的目标 $y_i$ 才能持续训练critic而不分歧。这可能会减慢学习速度，因为目标网络会延迟值估计的传播。然而，在实践中，我们发现学习的稳定性远远超过了这一点。

Directly implementing Q learning (equation 4) with neural networks proved to be unstable in many environments. Since the network $Q(s,a\mid\theta^Q)$ being updated is also used in calculating the target value (equation 5), the Q update is prone to divergence. Our solution is similar to the target network used in (Mnih et al., 2013) but modified for actor-critic and using “soft” target updates, rather than directly copying the weights. We create a copy of the actor and critic networks, $Q'(s,a\mid\theta^{Q'})$ and $µ'(s\mid\theta^{\mu'})$ respectively, that are used for calculating the target values. The weights of these target networks are then updated by having them slowly track the learned networks: $θ'\gets \tau \theta + (1 − \tau)θ'$ with $\tau \ll 1$ . This means that the target values are constrained to change slowly, greatly improving the stability of learning. This simple change moves the relatively unstable problem of learning the action-value function closer to the case of supervised learning, a problem for which robust solutions exist. We found that having both a target $µ'$ and $Q'$ was required to have stable targets $y_i$ in order to consistently train the critic without divergence. This may slow learning, since the target network delays the propagation of value estimations. However, in practice we found this was greatly outweighed by the stability of learning.

当从低维特征向量观察中学习时，观察的不同分量可以具有不同的物理单位（例如，位置与速度），并且范围可以在不同环境中变化。这可能使网络难以有效地学习，也可能使其难以找到具有不同规模状态值的交叉环境进行泛化的超参数。

When learning from low dimensional feature vector observations, the different components of the observation may have different physical units (for example, positions versus velocities) and the ranges may vary across environments. This can make it difficult for the network to learn effectively and may make it difficult to find hyper-parameters which generalise across environments with different scales of state values.

解决此问题的一种方法是手动缩放功能，使其在不同环境和单位的范围内相似。我们通过调整称为批量标准化的深度学习的最新技术来解决这个问题（Ioffe＆Szegedy，2015）⁶。该技术将小批量中样本的每个维度标准化，以具有单位均值和方差。此外，它维持平均值和方差的平均值，用于测试期间的标准化（在我们的例子中，在探索或评估期间）。在深度网络中，它用于通过确保每层接收白化输入来最小化训练期间的协方差偏移。在低维情况下，我们在动作输入之前对状态输入和 $μ$ 网络的所有层以及 $Q$ 网络的所有层使用批量标准化（网络的细节在补充材料中给出）。通过批量标准化，我们能够有效地学习具有不同类型单元的许多不同任务，而无需手动确保单元在设定范围内。

One approach to this problem is to manually scale the features so they are in similar ranges across environments and units. We address this issue by adapting a recent technique from deep learning called batch normalization (Ioffe & Szegedy, 2015). This technique normalizes each dimension across the samples in a minibatch to have unit mean and variance. In addition, it maintains a running average of the mean and variance to use for normalization during testing (in our case, during exploration or evaluation). In deep networks, it is used to minimize covariance shift during training, by ensuring that each layer receives whitened input. In the low-dimensional case, we used batch normalization on the state input and all layers of the µ network and all layers of the $Q$ network prior to the action input (details of the networks are given in the supplementary material). With batch normalization, we were able to learn effectively across many different tasks with differing types of units, without needing to manually ensure the units were within a set range.

在连续行动空间中学习的一个主要挑战是探索。 DDPG等离线策略算法的一个优点是我们可以独立于学习算法来处理探索问题。我们通过将噪声过程 $N$ 中采样的噪声添加到我们的actor Policy中，构建了一个 $μ'$ 的探索策略

A major challenge of learning in continuous action spaces is exploration. An advantage of off-policies algorithms such as DDPG is that we can treat the problem of exploration independently from the learning algorithm. We constructed an exploration policy $µ'$ by adding noise sampled from a noise process $N$ to our actor policy

$\tag{7}\mu'(s_t)=\mu(s_t\mid\theta_t^\mu)+N$

可以选择 $N$ 以适应环境。正如补充材料中详述的那样，我们使用Ornstein-Uhlenbeck过程(Uhlenbeck＆Ornstein，1930)¹⁰来产生时间相关的探索，以探索具有惯性的物理控制问题的探索效率（在(Wawrzynski, 2015)¹¹中引入了类似的自相关噪声的使用））。

$N$ can be chosen to chosen to suit the environment. As detailed in the supplementary materials we used an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) to generate temporally correlated exploration for exploration efficiency in physical control problems with inertia (similar use of autocorrelated noise was introduced in (Wawrzynski, 2015)).

在这里插入图片描述

4 RESULTS

我们构建了不同难度级别的模拟物理环境来测试我们的算法。这包括经典的强化学习环境，如cartpole，以及困难的高维任务，如gripper，涉及接触的任务，如puck striking（加拿大）和运动任务，如cheetah (Wawrzynski, 2009)¹²。除了cheetah外，所有区域的动作都是施加到被驱动关节上的力矩。使用MuJoCo模拟这些环境 (Todorov et al., 2012)¹³。图1显示了任务中使用的一些环境的渲染图（补充包含环境的详细信息，您可以在 https://goo.gl/J4PIAz 上查看一些学习的策略）。

We constructed simulated physical environments of varying levels of difficulty to test our algorithm. This included classic reinforcement learning environments such as cartpole, as well as difficult, high dimensional tasks such as gripper, tasks involving contacts such as puck striking (canada) and locomotion tasks such as cheetah (Wawrzynski, 2009). In all domains but cheetah the actions were torques applied to the actuated joints. These environments were simulated using MuJoCo (Todorov et al., 2012). Figure 1 shows renderings of some of the environments used in the task (the supplementary contains details of the environments and you can view some of the learned policies at https://goo.gl/J4PIAz).

在所有任务中，我们使用低维状态描述（例如关节角度和位置）和环境的高维呈现进行实验。如在DQN(Mnih et al., 2013; 2015)⁵ ²中，为了使问题在高维环境中几乎完全可观察，我们使用了动作重复。对于agent的每个timestep，我们执行模拟3个timestep，每次重复agent的动作并呈现。因此，向agent报告的observation包含9个特征图（3个渲染中的每一个的RGB），其允许agent使用帧之间的差异来推断速度。帧被下采样为64x64像素，8位RGB值被转换为浮点，缩放为[0, 1]。有关我们的网络结构和超参数的详细信息，请参阅补充信息。

In all tasks, we ran experiments using both a low-dimensional state description (such as joint angles and positions) and high-dimensional renderings of the environment. As in DQN (Mnih et al., 2013; 2015), in order to make the problems approximately fully observable in the high dimensional environment we used action repeats. For each timestep of the agent, we step the simulation 3 timesteps, repeating the agent’s action and rendering each time. Thus the observation reported to the agent contains 9 feature maps (the RGB of each of the 3 renderings) which allows the agent to infer velocities using the differences between frames. The frames were downsampled to 64x64 pixels and the 8-bit RGB values were converted to floating point scaled to [0, 1]. See supplementary information for details of our network structure and hyperparameters.

我们通过在没有探索噪声的情况下进行测试来定期评估策略。图2显示了一系列环境的性能曲线。我们还报告了删除了我们的算法组件（即目标网络或批量标准化）的结果。为了在所有任务中表现良好，这些添加都是必要的。特别是，在没有目标网络的情况下学习，就像在使用DPG的原始工作中一样，在许多环境中非常糟糕。

We evaluated the policy periodically during training by testing it without exploration noise. Figure 2 shows the performance curve for a selection of environments. We also report results with components of our algorithm (i.e. the target network or batch normalization) removed. In order to perform well across all tasks, both of these additions are necessary. In particular, learning without a target network, as in the original work with DPG, is very poor in many environments.

令人惊讶的是，在一些更简单的任务中，来自像素的学习策略与使用低维状态描述符学习一样快。这可能是由于重复操作使问题更简单。卷积层也可以提供易于分离的状态空间表示，这对于较高层快速学习是很容易的。

Surprisingly, in some simpler tasks, learning policies from pixels is just as fast as learning using the low-dimensional state descriptor. This may be due to the action repeats making the problem simpler. It may also be that the convolutional layers provide an easily separable representation of state space, which is straightforward for the higher layers to learn on quickly.

表1总结了DDPG在所有环境中的性能（结果平均超过5个副本）。我们使用两个基线对分数进行标准化。第一个基线是来自简单策略的平均回报，该策略从有效动作空间上的统一分布中采样动作。第二个基线是iLQG(Todorov & Li, 2005)¹⁴，这是一个基于规划的求解器，可以完全访问基础物理模型及其衍生物。我们对分数进行标准化，以使naive policy的平均分为0，iLQG的平均分为1。DDPG能够在许多任务上学习良好的策略，而且在许多情况下，有些副本学习的策略优于iLQG发现的策略，即使直接从像素学习也是如此。

Table 1 summarizes DDPG’s performance across all of the environments (results are averaged over 5 replicas). We normalized the scores using two baselines. The first baseline is the mean return from a naive policy which samples actions from a uniform distribution over the valid action space. The second baseline is iLQG (Todorov & Li, 2005), a planning based solver with full access to the underlying physical model and its derivatives. We normalize scores so that the naive policy has a mean score of 0 and iLQG has a mean score of 1. DDPG is able to learn good policies on many of the tasks, and in many cases some of the replicas learn policies which are superior to those found by iLQG, even when learning directly from pixels.

学习准确的价值估算可能具有挑战性。例如，Q-learning倾向于过高估计价值（Hasselt，2010）¹⁵。我们通过将训练后Q估计的值与测试集中看到的真实回报进行比较，从经验上检验了DDPG的估计值。图3显示，在简单的任务中，DDPG估计可以准确地返回而没有系统偏差。对于更难的任务，Q估计更糟糕，但DDPG仍然能够学习良好的策略。

It can be challenging to learn accurate value estimates. Q-learning, for example, is prone to overestimating values (Hasselt, 2010). We examined DDPG’s estimates empirically by comparing the values estimated by Q after training with the true returns seen on test episodes. Figure 3 shows that in simple tasks DDPG estimates returns accurately without systematic biases. For harder tasks the Q estimates are worse, but DDPG is still able learn good policies.

为了展示我们的方法的一般性，我们还包括Torcs，一种赛车游戏，其中的动作是加速，刹车和转向。 Torcs以前曾被用作其他策略学习方法的试验平台(Koutn´ık et al., 2014b)¹⁶。我们使用相同的网络架构和学习算法超参数来完成物理任务，但由于涉及的时间尺度非常不同，因此改变了探测的噪声过程。在低维和像素的情况下，一些副本能够学习合理的策略，这些策略能够在轨道上完成一个回路，而其他副本无法学习合理的策略。

To demonstrate the generality of our approach we also include Torcs, a racing game where the actions are acceleration, braking and steering. Torcs has previously been used as a testbed in other policy learning approaches (Koutn´ık et al., 2014b). We used an identical network architecture and learning algorithm hyper-parameters to the physics tasks but altered the noise process for exploration because of the very different time scales involved. On both low-dimensional and from pixels, some replicas were able to learn reasonable policies that are able to complete a circuit around the track though other replicas failed to learn a sensible policy.

在这里插入图片描述

图1：我们尝试使用DDPG解决的环境示例的屏幕截图示例。按顺序从左侧开始：推车上升任务，到达任务，喘气和移动任务，冰球击打任务，单声道平衡任务，两个运动任务和Torcs（驾驶模拟器）。我们使用低维特征向量和高维像素输入来处理所有任务。补充中提供了对环境的详细描述。有关部分学习策略的moive，请访问 https://goo.gl/J4PIAz。

Figure 1: Example screenshots of a sample of environments we attempt to solve with DDPG. In order from the left: the cartpole swing-up task, a reaching task, a gasp and move task, a puck-hitting task, a monoped balancing task, two locomotion tasks and Torcs (driving simulator). We tackle all tasks using both low-dimensional feature vector and high-dimensional pixel inputs. Detailed descriptions of the environments are provided in the supplementary. Movies of some of the learned policies are available at https://goo.gl/J4PIAz.

在这里插入图片描述

图2：使用DPG变体选择域的性能曲线：具有批量标准化（浅灰色）的原始DPG算法（minibatch NFQCA），具有目标网络（深灰色），具有目标网络和批量标准化（绿色），具有仅像素输入的目标网络（蓝色）。目标网络至关重要。

Figure 2: Performance curves for a selection of domains using variants of DPG: original DPG algorithm (minibatch NFQCA) with batch normalization (light grey), with target network (dark grey), with target networks and batch normalization (green), with target networks from pixel-only inputs (blue). Target networks are crucial.

5 RELATED WORK

最初的DPG论文使用tile-coding和线性函数逼近器评估了具有玩具问题的算法。它证明了离线DPG的数据效率优于on-policy stochastic actor critic和off-policy stochastic actor critic。它还解决了一个更具挑战性的任务，即多关节章鱼的手臂必须用四肢的任何部分打击目标。然而，正如我们在这里所做的那样，该论文没有展示如何扩展大型高维观测空间的方法。

The original DPG paper evaluated the algorithm with toy problems using tile-coding and linear function approximators. It demonstrated data efficiency advantages for off-policy DPG over both on- and off-policy stochastic actor critic. It also solved one more challenging task in which a multi-jointed octopus arm had to strike a target with any part of the limb. However, that paper did not demonstrate scaling the approach to large, high-dimensional observation spaces as we have here.

人们常常认为，标准的策略检索方法，例如目前工作中探索的方法，实在太脆弱，无法扩展到难以解决的问题(Levine et al., 2015)¹⁷。标准策略检索被认为是困难的，因为它同时处理复杂的环境动态和复杂的策略。实际上，大多数过去与 actor-critic和策略优化方法的工作都很那扩展到更具挑战性的问题(Deisenroth et al., 2013)¹⁸。通常，这是由于学习中的不稳定性，其中问题的进展要么被后续学习更新破坏，要么学习太慢而不实用。

It has often been assumed that standard policy search methods such as those explored in the present work are simply too fragile to scale to difficult problems (Levine et al., 2015). Standard policy search is thought to be difficult because it deals simultaneously with complex environmental dynamics and a complex policy. Indeed, most past work with actor-critic and policy optimization approaches have had difficulty scaling up to more challenging problems (Deisenroth et al., 2013). Typically, this is due to instability in learning wherein progress on a problem is either destroyed by subsequent learning updates, or else learning is too slow to be practical.

在这里插入图片描述

图3：密度图显示了估计的Q值与从5个副本上的测试集取样的观察到的回报。在诸如摆锤和推车的简单域中，Q值非常准确。在更复杂的任务中，Q估计值不太准确，但仍可用于学习合格的策略。虚线表示单位，单位是任意的

Figure 3: Density plot showing estimated Q values versus observed returns sampled from test episodes on 5 replicas. In simple domains such as pendulum and cartpole the Q values are quite accurate. In more complex tasks, the Q estimates are less accurate, but can still be used to learn competent policies. Dotted line indicates unity, units are arbitrary

表1：在所有环境中培训后的性能，最多250万步。我们报告平均值和最佳观察值（5次运行）。除了Torcs之外的所有分数被归一化，以便随机agent接收0和规划算法1; 对于Torcs，我们提供原始奖励分数。我们将DDPG算法的结果包含在低维（低）版本的环境和高维（pix）中。为了比较，我们还包括来自原始DPG算法的结果，其具有重放缓冲器和批量标准化（cntrl）。

Table 1: Performance after training across all environments for at most 2.5 million steps. We report both the average and best observed (across 5 runs). All scores, except Torcs, are normalized so that a random agent receives 0 and a planning algorithm 1; for Torcs we present the raw reward score. We include results from the DDPG algorithn in the low-dimensional (lowd) version of the environment and high-dimensional (pix). For comparision we also include results from the original DPG algorithm with a replay buffer and batch normalization (cntrl).

最近关于无模型策略搜索的工作表明，它可能不像以前认为的那样脆弱。Wawrzynski (2009)¹²; Wawrzynski＆Tanwani（2013）¹⁹使用重放缓冲器在actor-critic框架中训练随机策略。在我们的工作的同时，Balduzzi & Ghifary（2015）²⁰将DPG算法扩展为“偏离器”网络，该网络明确地学习 $\partial Q/\partial a$ 。但是，他们只在两个低维域上训练。 Heess et al.（2015）²¹引入了SVG（0），它也使用了Q-critic，但学习了随机策略。 DPG可以被认为是SVG（0）的确定性极限。我们在这里描述的用于扩展DPG的技术也适用于随机策略，方法是使用重新参数化技巧 (Heess et al., 2015; Schulman et al., 2015a)²¹ ²²。

Recent work with model-free policy search has demonstrated that it may not be as fragile as previously supposed. Wawrzynski (2009); Wawrzy ´ nski & Tanwani (2013) has trained stochastic policies in an actor-critic framework with a replay buffer. Concurrent with our work, Balduzzi & Ghifary (2015) extended the DPG algorithm with a “deviator” network which explicitly learns $\partial Q/\partial a$ . However, they only train on two low-dimensional domains. Heess et al. (2015) introduced SVG(0) which also uses a Q-critic but learns a stochastic policy. DPG can be considered the deterministic limit of SVG(0). The techniques we described here for scaling DPG are also applicable to stochastic policies by using the reparametrization trick (Heess et al., 2015; Schulman et al., 2015a).

另一种方法，信任区域策略优化 (TRPO) (Schulman et al., 2015b)²²，直接构建随机神经网络策略，而不将问题分解为最优控制和监督阶段。该方法通过对策略参数进行精心选择的更新，从而产生近乎单调的改进，从而限制更新以防止新策略与现有策略偏离太远。这种方法不需要学习action-value函数，并且（可能因此）似乎显着降低了数据效率。

Another approach, trust region policy optimization (TRPO) (Schulman et al., 2015b), directly constructs stochastic neural network policies without decomposing problems into optimal control and supervised phases. This method produces near monotonic improvements in return by making carefully chosen updates to the policy parameters, constraining updates to prevent the new policy from diverging too far from the existing policy. This approach does not require learning an action-value function, and (perhaps as a result) appears to be significantly less data efficient.

为了应对actor-critic方法的挑战，最近使用指导性策略搜索（GPS）算法的工作 (e.g., (Levine et al., 2015)¹⁷) 将问题分解为三个相对容易解决的阶段：首先，它使用全状态观测，以创建围绕一个或多个标称轨迹的动力学的局部线性近似，然后使用最优控制来找到沿这些轨迹的局部线性最优策略; 最后，它使用监督学习来训练复杂的非线性策略（例如深度神经网络）以再现优化轨迹的状态-动作映射。

To combat the challenges of the actor-critic approach, recent work with guided policy search (GPS) algorithms (e.g., (Levine et al., 2015)) decomposes the problem into three phases that are relatively easy to solve: first, it uses full-state observations to create locally-linear approximations of the dynamics around one or more nominal trajectories, and then uses optimal control to find the locally-linear optimal policy along these trajectories; finally, it uses supervised learning to train a complex, non-linear policy (e.g. a deep neural network) to reproduce the state-to-action mapping of the optimized trajectories.

这种方法有几个好处，包括数据效率，并已成功应用于各种使用视觉的现实世界机器人操作任务。在这些任务中，GPS使用与我们类似的卷积策略网络，但有两个明显的例外：1.它使用空间softmax将视觉特征的维度降低为每个特征映射的单个 $(x, y)$ 坐标，以及 2. 该策略还接收关于网络中第一完全连接层处的机器人配置的直接低维状态信息。两者都可能提高算法的功率和数据效率，并且可以在DDPG框架内轻松利用。

This approach has several benefits, including data efficiency, and has been applied successfully to a variety of real-world robotic manipulation tasks using vision. In these tasks GPS uses a similar convolutional policy network to ours with 2 notable exceptions: 1. it uses a spatial softmax to reduce the dimensionality of visual features into a single $(x, y)$ coordinate for each feature map, and 2. the policy also receives direct low-dimensional state information about the configuration of the robot at the first fully connected layer in the network. Both likely increase the power and data efficiency of the algorithm and could easily be exploited within the DDPG framework.

PILCO (Deisenroth & Rasmussen, 2011) ¹⁸使用高斯过程来学习动力学的非参数概率模型。使用这种学习模型，PILCO可计算分析策略梯度，并在许多控制问题中实现令人印象深刻的数据效率。然而，由于高计算需求，PILCO“对于高维问题”是不切实际的(Wahlstrom et al., ¨ 2015)²³。似乎深度函数逼近器是将强化学习扩展到大型高维域的最有前景的方法。

PILCO (Deisenroth & Rasmussen, 2011) uses Gaussian processes to learn a non-parametric, probabilistic model of the dynamics. Using this learned model, PILCO calculates analytic policy gradients and achieves impressive data efficiency in a number of control problems. However, due to the high computational demand, PILCO is “impractical for high-dimensional problems” (Wahlstrom et al., ¨ 2015). It seems that deep function approximators are the most promising approach for scaling reinforcement learning to large, high-dimensional domains.

Wahlstrom et al. (2015)²³ 使用深度动态模型网络和模型预测控制¨从像素输入解决钟摆摆动任务。他们训练了一个可微分的前向模型，并将目标状态编码到学习的潜在空间中。他们对学习模型使用模型预测控制来找到达到目标的策略。但是，这种方法只适用于具有目标状态的域，这些目标状态可以用算法演示。

Wahlstrom et al. (2015) used a deep dynamical model network along with model predictive control ¨ to solve the pendulum swing-up task from pixel input. They trained a differentiable forward model and encoded the goal state into the learned latent space. They use model-predictive control over the learned model to find a policy for reaching the target. However, this approach is only applicable to domains with goal states that can be demonstrated to the algorithm.

最近，已经使用进化方法从使用压缩权重参数化(Koutn´ık et al., 2014a)²⁴或无监督学习 (Koutn´ık et al., 2014b) ¹⁶的像素中学习Torcs的竞争策略，以减少演化权重的维数。目前尚不清楚这些方法对其他问题的概括性

Recently, evolutionary approaches have been used to learn competitive policies for Torcs from pixels using compressed weight parametrizations (Koutn´ık et al., 2014a) or unsupervised learning (Koutn´ık et al., 2014b) to reduce the dimensionality of the evolved weights. It is unclear how well these approaches generalize to other problems.

6 CONCLUSION

这项工作结合了深度学习和强化学习的最新进展的见解，产生了一种算法，即使在使用原始像素进行观察时，也可以通过连续的动作空间稳健地解决各种领域的挑战性问题。与大多数强化学习算法一样，非线性函数逼近器的使用使任何收敛保证无效; 然而，我们的实验结果证明了稳定的学习，而无需在环境之间进行任何修改。

The work combines insights from recent advances in deep learning and reinforcement learning, resulting in an algorithm that robustly solves challenging problems across a variety of domains with continuous action spaces, even when using raw pixels for observations. As with most reinforcement learning algorithms, the use of non-linear function approximators nullifies any convergence guarantees; however, our experimental results demonstrate that stable learning without the need for any modifications between environments.

有趣的是，我们所有的实验都使用了比DQN学习在Atari域中找到解决方案所用的经验步骤少得多的经验。我们所看到的几乎所有问题都在250万步的经验中得到了解决（通常要少得多），比DQN对优质Atari解决方案所需的步数少20倍。这表明，给定更多的模拟时间，DDPG可以解决比这里考虑的更困难的问题。

Interestingly, all of our experiments used substantially fewer steps of experience than was used by DQN learning to find solutions in the Atari domain. Nearly all of the problems we looked at were solved within 2.5 million steps of experience (and usually far fewer), a factor of 20 fewer steps than DQN requires for good Atari solutions. This suggests that, given more simulation time, DDPG may solve even more difficult problems than those considered here.

我们的方法仍然存在一些局限性。最值得注意的是，与大多数无模型强化方法一样，DDPG需要大量的训练集才能找到解决方案。然而，我们认为强大的无模型方法可能是可能攻击这些限制的大型系统的重要组成部分 (Glascher et al., 2010)²⁵。

A few limitations to our approach remain. Most notably, as with most model-free reinforcement approaches, DDPG requires a large number of training episodes to find solutions. However, we believe that a robust model-free approach may be an important component of larger systems which may attack these limitations (Glascher et al., 2010).

REFERENCES

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012. ↩︎
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. ↩︎ ↩︎ ↩︎ ↩︎
Silver, David, Lever, Guy, Heess, Nicolas, Degris, Thomas, Wierstra, Daan, and Riedmiller, Martin. Deterministic policy gradient algorithms. In ICML, 2014. ↩︎ ↩︎
Prokhorov, Danil V, Wunsch, Donald C, et al. Adaptive critic designs. Neural Networks, IEEE Transactions on, 8(5):997–1007, 1997. ↩︎
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Riedmiller, Martin. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. ↩︎ ↩︎ ↩︎ ↩︎
Ioffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. ↩︎ ↩︎
Tassa, Yuval, Erez, Tom, and Todorov, Emanuel. Synthesis and stabilization of complex behaviors through online trajectory optimization. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 4906–4913. IEEE, 2012. ↩︎
Watkins, Christopher JCH and Dayan, Peter. Q-learning. Machine learning, 8(3-4):279–292, 1992. ↩︎
Hafner, Roland and Riedmiller, Martin. Reinforcement learning in feedback control. Machine learning, 84(1-2):137–169, 2011. ↩︎
Uhlenbeck, George E and Ornstein, Leonard S. On the theory of the brownian motion. Physical review, 36(5):823, 1930. ↩︎
Wawrzynski, Paweł. Control policy with autocorrelated noise in reinforcement learning for robotics. ´ International Journal of Machine Learning and Computing, 5:91–95, 2015. ↩︎
Wawrzynski, Paweł. Real-time reinforcement learning by sequential actor–critics and experience ´ replay. Neural Networks, 22(10):1484–1497, 2009. ↩︎ ↩︎
Todorov, Emanuel, Erez, Tom, and Tassa, Yuval. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026– 5033. IEEE, 2012. ↩︎
Todorov, Emanuel and Li, Weiwei. A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems. In American Control Conference, 2005. Proceedings of the 2005, pp. 300–306. IEEE, 2005. ↩︎
Hasselt, Hado V. Double q-learning. In Advances in Neural Information Processing Systems, pp. 2613–2621, 2010. ↩︎
Koutn´ık, Jan, Schmidhuber, Jurgen, and Gomez, Faustino. Online evolution of deep convolutional ¨ network for vision-based reinforcement learning. In From Animals to Animats 13, pp. 260–269. Springer, 2014b. ↩︎ ↩︎
Levine, Sergey, Finn, Chelsea, Darrell, Trevor, and Abbeel, Pieter. End-to-end training of deep visuomotor policies. arXiv preprint arXiv:1504.00702, 2015. ↩︎ ↩︎
Deisenroth, Marc Peter, Neumann, Gerhard, Peters, Jan, et al. A survey on policy search for robotics. Foundations and Trends in Robotics, 2(1-2):1–142, 2013. ↩︎ ↩︎
Wawrzynski, Paweł and Tanwani, Ajay Kumar. Autonomous reinforcement learning with experience ´ replay. Neural Networks, 41:156–167, 2013. ↩︎
Balduzzi, David and Ghifary, Muhammad. Compatible value gradients for reinforcement learning of continuous deep policies. arXiv preprint arXiv:1509.03005, 2015. ↩︎
Heess, N., Hunt, J. J, Lillicrap, T. P, and Silver, D. Memory-based control with recurrent neural networks. NIPS Deep Reinforcement Learning Workshop (arXiv:1512.04455), 2015. ↩︎ ↩︎
Schulman, John, Levine, Sergey, Moritz, Philipp, Jordan, Michael I, and Abbeel, Pieter. Trust region policy optimization. arXiv preprint arXiv:1502.05477, 2015b. ↩︎ ↩︎
Wahlstrom, Niklas, Sch ¨ on, Thomas B, and Deisenroth, Marc Peter. From pixels to torques: Policy ¨ learning with deep dynamical models. arXiv preprint arXiv:1502.02251, 2015. ↩︎ ↩︎
Koutn´ık, Jan, Schmidhuber, Jurgen, and Gomez, Faustino. Evolving deep unsupervised convolu- ¨ tional networks for vision-based reinforcement learning. In Proceedings of the 2014 conference on Genetic and evolutionary computation, pp. 541–548. ACM, 2014a. ↩︎
Glascher, Jan, Daw, Nathaniel, Dayan, Peter, and O’Doherty, John P. States versus rewards: dis- ¨ sociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron, 66(4):585–595, 2010. ↩︎

菜菜菜菜菜菜菜

发布了43 篇原创文章 · 获赞 19 · 访问量 8527

私信关注

Continuous control with deep reinforcement learning (DDPG强化学习) 论文翻译

连续控制与深度强化学习

Abstract

1 INTRODUCTION

2.0 BACKGROUND

3 ALGORITHM

4 RESULTS

5 RELATED WORK

6 CONCLUSION

REFERENCES

猜你喜欢