Deterministic Policy Gradient Algorithms

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra & Martin Riedmiller

Abstract

在本文中，我们考虑确定性策略梯度算法，用于连续行动的强化学习。确定性策略梯度具有特别吸引人的形式：它是action-value函数的预期梯度。这种简单形式意味着确定性策略梯度比通常的随机策略梯度估计效率高得多。为了保证充分的探索，我们引入了一种从探索行为策略中学习确定性目标策略的离线策略actor-critic算法。结果表明，确定性策略梯度算法在高维行动空间的性能明显优于随机策略梯度算法。

In this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic policy gradient has a particularly appealing form: it is the expected gradient of the action-value function. This simple form means that the deterministic policy gradient can be estimated much more efficiently than the usual stochastic policy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counterparts in high-dimensional action spaces.

1. Introduction

策略梯度算法广泛用于具有连续动作空间的强化学习问题。基本思想是通过参数概率分布 $π_θ(a\mid s) =P[a\mid s;θ]$ 来表示策略，其根据参数矢量 $θ$ 随机地选择状态 $s$ 中的动作 $a$ 。策略梯度算法通常通过对该随机策略进行采样并根据更大的累积奖励的方向来调整策略参数。

Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces. The basic idea is to represent the policy by a parametric probability distribution $π_θ(a\mid s) =ℙ[a\mid s;θ]$ that stochastically selects action a in state s according to parameter vector θ. Policy gradient algorithms typically proceed by sampling this stochastic policy and adjusting the policy parameters in the direction of greater cumulative reward.

在本文中，我们改为考虑确定性策略 $a = \mu_θ(s)$ 。很自然地想知道随机策略是否可以采用相同的方法：在策略梯度的方向上调整策略参数。先前认为确定性策略梯度不存在，或者只能在使用模型时获得（Peters，2010）。然而，我们表明确定性策略梯度确实存在，而且它有一个简单的model-free形式，它简单地遵循action-value函数的梯度。此外，我们证明了确定性策略梯度是随机策略梯度的策略方差倾向于零的极限情况。

In this paper we instead consider deterministic policies $a = \mu_θ(s)$ . It is natural to wonder whether the same approach can be followed as for stochastic policies: adjusting the policy parameters in the direction of the policy gradient. It was previously believed that the deterministic policy gradient did not exist, or could only be obtained when using a model (Peters, 2010). However, we show that the deterministic policy gradient does indeed exist, and furthermore it has a simple model-free form that simply follows the gradient of the action-value function. In addition, we show that the deterministic policy gradient is the limiting case, as policy variance tends to zero, of the stochastic policy gradient.

从实践的角度来看，随机策略和确定性策略梯度之间存在着关键的差异。在随机情况下，策略梯度在状态和动作空间上进行整合，而在确定性情况下，它仅在状态空间上进行整合。因此，计算随机策略梯度可能需要更多样本，特别是如果动作空间具有许多维度。

From a practical viewpoint, there is a crucial difference between the stochastic and deterministic policy gradients. In the stochastic case, the policy gradient integrates over both state and action spaces, whereas in the deterministic case it only integrates over the state space. As a result, computing the stochastic policy gradient may require more samples, especially if the action space has many dimensions.

为了探索完整的状态和行为空间，随机策略常常是必要的。为了保证我们的确定性策略梯度算法继续得到满意的探索，我们引入了一种离线策略学习算法。其基本思想是根据随机行为策略选择行动(以确保足够的探索)，但要了解确定性目标策略(利用确定性策略梯度的效率) 。利用确定性策略梯度，推导出一种利用可微函数逼近器估计动作值函数的离线策略actor-critic算法，并根据近似的动作值梯度方向更新策略参数。我们还引入了确定性策略梯度的兼容函数逼近的概念，以确保近似不会偏离策略梯度。

In order to explore the full state and action space, a stochastic policy is often necessary. To ensure that our deterministic policy gradient algorithms continue to explore satisfactorily, we introduce an off-policy learning algorithm. The basic idea is to choose actions according to a stochastic behaviour policy (to ensure adequate exploration), but to learn about a deterministic target policy (exploiting the efficiency of the deterministic policy gradient). We use the deterministic policy gradient to derive an off-policy actor-critic algorithm that estimates the action-value function using a differentiable function approximator, and then updates the policy parameters in the direction of the approximate action-value gradient. We also introduce a notion of compatible function approximation for deterministic policy gradients, to ensure that the approximation does not bias the policy gradient.

我们将确定性的actor-critic算法应用于几个基准问题：高维强盗; 几个具有低维动作空间的标准基准强化学习任务; 以及控制章鱼臂的高维任务。我们的结果表明，使用确定性策略梯度优于随机梯度，特别是在高维任务中，具有显着的性能优势。此外，我们的算法不需要比先前方法更多的计算：每个更新的计算成本在动作维度和策略参数的数量上是线性的。最后，有许多应用程序(例如机器人)提供了可微控制策略，但是没有将噪声注入控制器的功能。在这些情况下，随机策略梯度不适用，而我们的方法可能仍然有用。

We apply our deterministic actor-critic algorithms to several benchmark problems: a high-dimensional bandit; several standard benchmark reinforcement learning tasks with low dimensional action spaces; and a high-dimensional task for controlling an octopus arm. Our results demonstrate a significant performance advantage to using deterministic policy gradients over stochastic policy gradients, particularly in high dimensional tasks. Furthermore, our algorithms require no more computation than prior methods: the computational cost of each update is linear in the action dimensionality and the number of policy parameters. Finally, there are many applications (for example in robotics) where a differentiable control policy is provided, but where there is no functionality to inject noise into the controller. In these cases, the stochastic policy gradient is inapplicable, whereas our methods may still be useful.

2. Background

2.1. Preliminaries

我们研究了强化学习和控制问题，其中一个agent 行动在随机环境中通过在一系列时间步骤上顺序选择行为，以最大化累积奖励。我们将问题建模为马尔可夫决策过程（MDP），其中包括：状态空间 $S$ ，动作空间 $A$ ，具有密度 $p_1(s_1)$ 的初始状态分布，具有条件密度的静态过渡动态分布 $p(s_{t+1}\mid s_t,a_t)$ 满足马尔可夫属性 $p(s_{t+1}\mid s_1,a_1,..., s_t,a_t) = p(s_{t+1}\mid s_t,a_t)$ ，对于状态—动作空间中的任何轨迹 $s_1,a_1,s_2,a_2,..., s_T , a_T$ ，以及奖励函数 $r:S×A\to R$ 。策略用于选择MDP中的操作。一般来说，策略是随机的，用 $π_θ:S\to P(A)$ 表示，其中 $P(A)$ 是 $A$ 的概率度量集合， $θ \in R^n$ 是一个有 $n$ 个参数的向量，和 $π_θ(a_t\mid s_t)$ 是与策略关联的条件概率密度 $a_t$ 。agent使用其策略与MDP交互以提供状态，动作和奖励的轨迹， $h_1:T = s_1, a_1, r_1,..., s_T, a_T, r_T$ 对应于 $S×A×R$ 。返回 ${r_t}^γ$ 是从时间步长 $t$ 起的总折扣奖励， ${r_t}^γ = \textstyle\sum_{k=t}^{\infty}γ^{k−t}r(s_k,a_k)$ 其中 $0 < γ < 1$ 。价值函数被定义为预期的总折扣奖励， $V^π(s) = E[{r_1}^γ\mid S_1 = s; π]$ 和 $Q^π(s,a) = E[{r_1}^γ\mid S_1 = s,A_1 = a,π]$ 。agent的目标是获得一个策略，该策略从开始状态最大化累积折扣奖励，由性能目标 $J(π) = E [{r_1}^γ\mid π]$ 表示。

We study reinforcement learning and control problems in which an agent acts in a stochastic environment by sequentially choosing actions over a sequence of time steps, in order to maximise a cumulative reward. We model the problem as a Markov decision process (MDP) which comprises: a state space $S$ , an action space $A$ , an initial state distribution with density $p_1(s_1)$ , a stationary transition dynamics distribution with conditional density $p(s_{t+1}\mid s_t,a_t)$ satisfying the Markov property $p(s_{t+1}\mid s_1,a_1,..., s_t,a_t) = p(s_{t+1}\mid s_t,a_t)$ , for any trajectory $s_1,a_1,s_2,a_2,..., s_T , a_T$ in state-action space, and a reward function $r:S×A\to R$ . A policy is used to select actions in the MDP. In general the policy is stochastic and denoted by $π_θ:S\to P(A)$ , where $P(A)$ is the set of probability measures on $A$ and $θ \in R^n$ is a vector of $n$ parameters, and $π_θ(a_t\mid s_t)$ is the conditional probability density $a_t$ at associated with the policy. The agent uses its policy to interact with the MDP to give a trajectory of states, actions and rewards, $h_1:T = s_1, a_1, r_1,..., s_T, a_T, r_T$ over $S×A×R$ . The return ${r_t}^γ$ is the total discounted reward from time-step $t$ onwards, ${r_t}^γ = \textstyle\sum_{k=t}^{\infty}γ^{k−t}r(s_k,a_k)$ where $0 < γ < 1$ . Value functions are defined to be the expected total discounted reward, $V^π(s) = E[{r_1}^γ\mid S_1 = s; π]$ and $Q^π(s,a) = E[{r_1}^γ\mid S_1 = s,A_1 = a,π]$ . The agent’s goal is to obtain a policy which maximises the cumulative discounted reward from the start state, denoted by the performance objective $J(π) = E [{r_1}^γ\mid π]$ .

我们用 $p(s\to s',t,π)$ 从状态 $s$ 过渡 $t$ 个时间步后表示 $s'$ 的密度。我们还用 $ρ^π(s') := \int _S \textstyle\sum_{k=t}^{\infty}γ^{t-1}p_1(s)p(s\to s',t,\pi)ds$ 来表示（不正确的）折扣状态分布。然后我们可以将性能目标写成期望，

We denote the density at state $s'$ after transitioning for $t$ time steps from state $s$ by $p(s\to s',t,π)$ . We also denote the (improper) discounted state distribution by $ρ^π(s') := \int _S \textstyle\sum_{k=t}^{\infty}γ^{t-1}p_1(s)p(s\to s',t,\pi)ds$ . We can then write the performance objective as an expectation,
$\tag{1} J(\pi_\theta)=\int_S \rho^\pi (s) \int_A \pi_\theta (s,a)r(s,a)dads=E_{s\thicksim \rho ^\pi,a\thicksim \pi_\theta }[r(s,a)]$

其中 $E_{s\thicksim ρ} [·]$ 表示相对于折扣状态分布 $\rho (s)$ 2的（不正确的）预期值。在本文的其余部分，为简单起见，我们假设 $A = R^m$ 且 $S$ 是 $R^d$ 的紧凑子集。

where $E_{s\thicksim ρ} [·]$ denotes the (improper) expected value with respect to discounted state distribution $ρ(s)$ .2 In the remainder of the paper we suppose for simplicity that $A = R^m$ and that $S$ is a compact subset of $R^d$ .

2.2. Stochastic Policy Gradient Theorem

策略梯度算法可能是最流行的一类连续动作强化学习算法。这些算法背后的基本思想是在性能梯度 $\nabla_\theta J(π_θ)$ 的方向上调整策略的参数 $θ$ 。这些算法的基本结果是策略梯度定理(Sutton et al., 1999)，

Policy gradient algorithms are perhaps the most popular class of continuous action reinforcement learning algorithms. The basic idea behind these algorithms is to adjust the parameters θ of the policy in the direction of the performance gradient $\nabla_\theta J(π_θ)$ . The fundamental result underlying these algorithms is the policy gradient theorem (Sutton et al., 1999),

$\tag{2} \nabla J(\pi_\theta)=\int_S \rho^\pi (s) \int_A \nabla_\theta\pi_\theta (s\mid a)Q^\pi(s,a)dads=E_{s\thicksim \rho ^\pi,a\thicksim \pi_\theta }[\nabla_\theta log \pi_\theta(a\mid s)Q^\pi(s,a)]$

策略梯度非常简单。特别是，尽管状态分布 $ρ^π(s)$ 取决于策略参数，但策略梯度不依赖于状态分布的梯度。

The policy gradient is surprisingly simple. In particular, despite the fact that the state distribution $ρ^π(s)$ depends on the policy parameters, the policy gradient does not depend on the gradient of the state distribution.

该定理具有重要的实用价值，因为它将性能梯度的计算减少到简单的期望。通过形成基于样本的这一期望的估计，策略梯度定理已经被用来推导各种策略梯度算法(Degris et al.， 2012a)。这些算法必须解决的一个问题是如何估计动作值函数 $Q^π(s, a)$ 。也许最简单的方法是使用样本返回 ${r_t}^γ$ 来估计 $Q^π(s_t,a_t)$ 的值，这导致REINFORCE算法的变体（Williams，1992）。

This theorem has important practical value, because it reduces the computation of the performance gradient to a simple expectation. The policy gradient theorem has been used to derive a variety of policy gradient algorithms (Degris et al., 2012a), by forming a sample-based estimate of this expectation. One issue that these algorithms must address is how to estimate the action-value function $Q^π(s, a)$ . Perhaps the simplest approach is to use a sample return ${r_t}^γ$ to estimate the value of $Q^π(s_t,a_t)$ , which leads to a variant of the REINFORCE algorithm (Williams, 1992).

2.3. Stochastic Actor-Critic Algorithms

Actor-Critic是基于策略梯度定理的广泛使用的框架（Sutton等人，1999; Peters等人，2005; Bhatnagar等人，2007; Degris等人，2012a）。Actor-Critic由两个同名组成部分组成。actor通过等式2的随机梯度上升来调整随机策略 $π_θ(s)$ 的参数 $θ$ 。带有参数向量 $w$ 的动作值函数 $Q^w(s, a)$ 被用来代替等式2中的未知真实动作值函数 $Q^π(s,a)$ 。critic使用适当的策略评估算法（例如时间差异学习）估计动作 - 值函数 $Q^w(s, a) ≈ Q^π(s, a)$ 。通常，用函数逼近器 $Q^w(s, a)$ 代替真实的动作值函数 $Q^π(s, a)$ 可能会引入偏差。然而，如果函数逼近器是兼容的，则i） $Q^w(s, a) = \nabla_θ log π_θ(a\mid s)^\intercal w$ 和 ii）选择参数w以使最小化均方误差 $\epsilon^2 (w) = E_{s\thicksim \rho ^\pi,a\thicksim \pi_\theta }[(Q^w(s,a)-Q^\pi (s,a))^2]$ ，然后就没有偏差（Sutton et al。，1999），

The actor-critic is a widely used architecture based on the policy gradient theorem (Sutton et al., 1999; Peters et al., 2005; Bhatnagar et al., 2007; Degris et al., 2012a). The actor-critic consists of two eponymous components. An actor adjusts the parameters $θ$ of the stochastic policy $π_θ(s)$ by stochastic gradient ascent of Equation 2. Instead of the unknown true action-value function $Q^π(s,a)$ in Equation 2, an action-value function $Q^w(s, a)$ is used, with parameter vector w. A critic estimates the action-value function $Q^w(s, a) ≈ Q^π(s, a)$ using an appropriate policy evaluation algorithm such as temporal-difference learning. In general, substituting a function approximator $Q^w(s, a)$ for the true action-value function $Q^π(s, a)$ may introduce bias. However, if the function approximator is compatible such that i) $Q^w(s, a) = \nabla_θ log π_θ(a\mid s)^\intercal w$ and ii) the parameters w are chosen to minimise the mean-squared error $\epsilon^2 (w) = E_{s\thicksim \rho ^\pi,a\thicksim \pi_\theta }[(Q^w(s,a)-Q^\pi (s,a))^2]$ , then there is no bias (Sutton et al., 1999),

$\tag{3} \nabla J(\pi_\theta)=E_{s\thicksim \rho ^\pi,a\thicksim \pi_\theta }[\nabla_\theta log \pi_\theta(a\mid s)Q^w(s,a)]$

更直观地说，条件i）表示兼容函数逼近器在随机策略的“特征”中是线性的， $\nabla_θ log π_θ(a\mid s)$ ，条件ii）要求参数是线性回归的解决方案从这些特征估计 $Q^π(s, a)$ 的问题。在实践中，条件ii）通常放宽，有利于通过时间差异学习更有效地估计价值函数的策略评估算法（Bhatnagar等，2007; Degris等，2012b; Peters等，2005）; 事实上，如果i）和ii）都满足，则整体算法相当于完全没有使用critic（Sutton等，2000），就像REINFORCE算法（Williams，1992）。

More intuitively, condition i) says that compatible function approximators are linear in “features” of the stochastic policy, $\nabla_θ log π_θ(a\mid s)$ , and condition ii) requires that the parameters are the solution to the linear regression problem that estimates $Q^π(s, a)$ from these features. In practice, condition ii) is usually relaxed in favour of policy evaluation algorithms that estimate the value function more efficiently by temporal-difference learning (Bhatnagar et al., 2007; Degris et al., 2012b; Peters et al., 2005); indeed if both i) and ii) are satisfied then the overall algorithm is equivalent to not using a critic at all (Sutton et al., 2000), much like the REINFORCE algorithm (Williams, 1992).

2.4. Off-Policy Actor-Critic

从不同行为策略 $β(a\mid s) \not= π_θ(a\mid s)$ 中采样的轨迹估计离线策略的策略梯度通常很有用。在离线策略环境中，性能目标通常被修改为目标策略的价值函数，在行为策略的状态分布上取平均值（Degris等，2012b），

It is often useful to estimate the policy gradient off-policy from trajectories sampled from a distinct behaviour policy $β(a\mid s) \not= π_θ(a\mid s)$ . In an off-policy setting, the performance objective is typically modified to be the value function of the target policy, averaged over the state distribution of the behaviour policy (Degris et al., 2012b),

$J_\beta(\pi_\theta)=\int_S\rho^\beta(s)V^\pi(s)ds=\int_S\int_A\rho^\beta(s)\pi_\theta(a\mid s)Q^\pi(s,a)dads$

区分性能目标并应用近似给出了离线策略的策略梯度

Differentiating the performance objective and applying an approximation gives the off-policy policy-gradient (Degris et al., 2012b)

$\tag{4} \nabla_\theta J_\beta(\pi_\theta)\approx\int_S\int_A\rho^\beta(s)\nabla_\theta\pi_\theta Q^\pi(s,a)dads$
$\tag{5} =E_{s\thicksim\rho^\beta,a\thicksim\beta}\lbrack\frac{\pi_\theta(a\mid s)}{\beta_\theta(a\mid s)}\nabla_\theta log\pi_\theta(a\mid s)Q^\pi(s,a)\rbrack$

这个近似去掉了一个依赖于动作值梯度 $\nabla_θQ^π(s,a)$ 的项）; Degris et al. (2012b)认为这是一个很好的近似，因为它可以保留梯度上升收敛的局部最优集。 Off-Policy Actor-Critic（OffPAC）算法(Degris et al., 2012b)使用行为策略 $β(a\mid s)$ 来生成轨迹。critic通过梯度时间差学习，从这些轨迹中估计出一个离线策略的状态值函数 $V^v(s) \approx V^π(s)$ (Sutton et al., 2009)。通过随机梯度上升等式5，actor更新策略参数 $θ$ ,也从这些离线策略轨迹。代替等式5中的未知动作-值函数 $Q^π(s, a)$ ，时间差分误差 $δ_t$ 被使用， $δ_t = r_{t+1} + γV^v(s_{t+1}) − V^v(s_t)$ ；这可以被证明是对真实梯度的近似（Bhatnagar等，2007）。actor和critic家都使用重要性抽样比 $\frac{\pi_\theta(a\mid s)}{\beta_\theta(a\mid s)}$ 来调整根据 $π$ 而不是 $β$ 选择动作的事实。

This approximation drops a term that depends on the action-value gradient $\nabla_θQ^π(s,a)$ ; Degris et al. (2012b) argue that this is a good approximation since it can preserve the set of local optima to which gradient ascent converges. The Off-Policy Actor-Critic (OffPAC) algorithm (Degris et al., 2012b) uses a behaviour policy $β(a\mid s)$ to generate trajectories. A critic estimates a state-value function, $V^v(s) \approx V^π(s)$ , off-policy from these trajectories, by gradient temporal-difference learning (Sutton et al., 2009). An actor updates the policy parameters $θ$ , also off-policy from these trajectories, by stochastic gradient ascent of Equation 5. Instead of the unknown action-value function $Q^π(s, a)$ in Equation 5, the temporal-difference error $δ_t$ is used, $δ_t = r_{t+1} + γV^v(s_{t+1}) − V^v(s_t)$ ; this can be shown to provide an approximation to the true gradient (Bhatnagar et al., 2007). Both the actor and the critic use an importance sampling ratio $\frac{\pi_\theta(a\mid s)}{\beta_\theta(a\mid s)}$ to adjust for the fact that actions were selected according to $π$ rather than $β$ .

REFERENCES

菜菜菜菜菜菜菜

发布了43 篇原创文章 · 获赞 19 · 访问量 8524

私信关注