Policy Gradient Methods for Reinforcement Learning with Functionn Approximation

Richard S. Sutton, David McAllester, Satinder Singh & Yishay Mansour

Abstract

函数逼近对强化学习至关重要，但迄今为止，逼近值函数并从中确定策略的标准方法在理论上被证明是难以实现的。在本文中，我们探索了一种替代方法，其中策略由其自身的函数逼近器明确表示，独立于值函数，并根据与策略参数相关的预期回报的梯度进行更新。威廉姆斯的REINFORCE方法和actor-critic方法就是这种方法的例子。我们的主要新结果是，梯度可以被写成一种形式，这种形式适用于由经验估计，辅以近似的行动值或优势函数。使用此结果，我们证明了第一次使用任意可扩散函数近似的策略迭代版本收敛于局部最优策略。

Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams’s REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary difierentiable function approximation is convergent to a locally optimal policy.

Introduction

强化学习（RL）的大型应用需要使用广义函数逼近器，例如神经网络，决策树或基于实例的方法。过去十年的主导方法是价值函数方法，其中所有函数逼近效应都用于估计价值函数，行动选择策略隐含地表示为关于估计值的“贪婪”策略（例如，作为在每个状态选择具有最高估计值的动作的策略）。价值函数方法在许多应用程序中都运行良好，但有一些局限性。首先，它面向确定性策略，而最优策略通常是随机的，选择具有特定概率的不同动作（例如，参见Singh，Jaakkola和Jordan，1994）。

Large applications of reinforcement learning (RL) require the use of generalizing function approximators such neural networks, decision-trees, or instance-based methods. The dominant approach for the last decade has been the value-function approach, in which all function approximation efiort goes into estimating a value function, with the action-selection policy represented implicitly as the “greedy” policy with respect to the estimated values (e.g., as the policy that selects in each state the action with highest estimated value). The value-function approach has worked well in many applications, but has several limitations. First, it is oriented toward flnding deterministic policies, whereas the optimal policy is often stochastic, selecting different actions with specific probabilities (e.g., see Singh, Jaakkola, and Jordan, 1994).

其次，动作的估计值的任意小的变化可以使其被选择或不被选择。这种不连续的变化已被确定为遵循价值函数方法建立算法收敛保证的关键障碍（Bertsekas和Tsitsiklis，1996）。例如，Q-learning，Sarsa和动态编程方法都显示出无法收敛到简单MDP和简单函数逼近器的任何策略（Gordon，1995,1996; Baird，1995; Tsitsiklis和van Roy，1996; Bertsekas和 Tsitsiklis，1996）。即使在改变策略之前的每一步都找到了最佳逼近，并且“最佳”的概念是在均方误差意义上，还是在残差双梯度、时间差异和动态编程方法的稍微不同的意义上，也会发生这种情况。

Second, an arbitrarily small change in the estimated value of an action can cause it to be, or not be, selected. Such discontinuous changes have been identifled as a key obstacle to establishing convergence assurances for algorithms following the value-function approach (Bertsekas and Tsitsiklis, 1996). For example, Q-learning, Sarsa, and dynamic programming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (Gordon, 1995, 1996; Baird, 1995; Tsitsiklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996). This can occur even if the best approximation is found at each step before changing the policy, and whether the notion of “best” is in the mean-squared-error sense or the slightly difierent senses of residual-gradient, temporal-difierence, and dynamic-programming methods.

在本文中，我们探索了RL中函数逼近的另一种方法。我们不是近似值函数并使用它来计算确定性策略，而是直接使用具有自己参数的独立函数逼近器来近似随机策略。例如，策略可以由神经网络表示，其输入是状态的表示，其输出是动作选择概率，并且其权重是策略参数。设θ表示策略参数的矢量和 $\rho$ 表示相应策略的性能（例如，每步的平均奖励）。然后，在策略梯度方法中，策略参数大致与梯度成比例地更新：

In this paper we explore an alternative approach to function approximation in RL. Rather than approximating a value function and using that to compute a deterministic policy, we approximate a stochastic policy directly using an independent function approximator with its own parameters. For example, the policy might be represented by a neural network whose input is a representation of the state, whose output is action selection probabilities, and whose weights are the policy parameters. Let θ denote the vector of policy parameters and ‰ the performance of the corresponding policy (e.g., the average reward per step). Then, in the policy gradient approach, the policy parameters are updated approximately proportional to the gradient:

$\tag{1} \nabla\approx\alpha\frac{\partial\rho}{\partial\theta}$

其中 $\alpha$ 是一个正定步长。如果可以实现上述目标，那么通常可以确保 $θ$ 收敛于性能指标 $\alpha$ 中的本地最优策略。与价值函数方法不同，这里 $θ$ 的微小变化只会导致策略和状态访问分布的微小变化。

where $\alpha$ is a positive-definite step size. If the above can be achieved, then $θ$ can usually be assured to converge to a locally optimal policy in the performance measure $\rho$ . Unlike the value-function approach, here small changes in $θ$ can cause only small changes in the policy and in the state-visitation distribution.

本文证明了利用满足一定性质的近似值函数，可以得到梯度(1)的无偏估计。 Williams(1988, 1992)的REINFORCE算法也发现了梯度的无偏估计，但是没有学习值函数的帮助。 REINFORCE比使用值函数的RL方法学得慢得多，并且受到的关注相对较少。学习价值函数并使用它来减少梯度估计的方差似乎对快速学习至关重要。Jaakkola, Singh and Jordan(1995)证明了一个与我们的结果非常相似的特殊情况下的函数近似对应的表格POMDPs。我们的结果强化了它们并将其推广到任意可分辨函数逼近器。 Konda和Tsitsiklis（在准备中）独立地为我们开发了一个非常类似的结果。另见Baxter和Bartlett（准备中）和Marbach和Tsitsiklis（1998）。

In this paper we prove that an unbiased estimate of the gradient (1) can be obtained from experience using an approximate value function satisfying certain properties. Williams’s (1988, 1992) REINFORCE algorithm also finds an unbiased estimate of the gradient, but without the assistance of a learned value function. REINFORCE learns much more slowly than RL methods using value functions and has received relatively little attention. Learning a value function and using it to reduce the variance of the gradient estimate appears to be essential for rapid learning. Jaakkola, Singh and Jordan (1995) proved a result very similar to ours for the special case of function approximation corresponding to tabular POMDPs. Our result strengthens theirs and generalizes it to arbitrary difierentiable function approximators. Konda and Tsitsiklis (in prep.) independently developed a very simialr result to ours. See also Baxter and Bartlett (in prep.) and Marbach and Tsitsiklis (1998).

我们的结果还提出了一种方法来证明基于"actor-critic"或策略迭代体系结构的各种算法的收敛性（例如，Barto，Sutton和Anderson，1983; Sutton，1984; Kimura和Kobayashi，1998）。在本文中，我们通过证明具有一般可扩散函数近似的策略迭代版本收敛于局部最优策略的第一时间，朝着这个方向迈出了第一步。 Baird和Moore（1999）对他们的VAPS方法系列获得了较弱但表面上相似的结果。与策略梯度方法一样，VAPS包括由梯度方法更新的单独参数化策略和值函数。然而，VAPS方法并不攀升性能梯度(预期长期回报)，而是将性能与价值函数的准确性相结合。。因此，VAPS不会收敛到局部最优策略，除非在值 - 函数准确性上没有权重，在这种情况下VAPS退化为REINFORCE。同样，Gordon(1995)的拟合值迭代也是收敛的，也是基于值的，但没有找到局部最优的策略。

Our result also suggests a way of proving the convergence of a wide variety of algorithms based on “actor-critic” or policy-iteration architectures (e.g., Barto, Sutton, and Anderson, 1983; Sutton, 1984; Kimura and Kobayashi, 1998). In this paper we take the first step in this direction by proving for the flrst time that a version of policy iteration with general difierentiable function approximation is convergent to a locally optimal policy. Baird and Moore (1999) obtained a weaker but superficially similar result for their VAPS family of methods. Like policy-gradient methods, VAPS includes separately parameterized policy and value functions updated by gradient methods. However, VAPS methods do not climb the gradient of performance (expected long-term reward), but of a measure combining performance and value-function accuracy. As a result, VAPS does not converge to a locally optimal policy, except in the case that no weight is put upon value-function accuracy, in which case VAPS degenerates to REINFORCE. Similarly, Gordon’s (1995) fitted value iteration is also convergent and value-based, but does not flnd a locally optimal policy.

1 Policy Gradient Theorem

我们考虑标准强化学习框架（参见，例如，Sutton和Barto，1998），其中学习agent与马尔可夫决策过程（MDP）相互作用。每一次 $t\in\{0,1,2,...\}$ 的状态、行为和奖励被表示为 $s_t\in S$ , $a_t\in A$ and $r_t\in\Re$ 。环境的动态特征是状态转移概率， $P_{ss'}^a=P_r\{s_{t+1}=s'\mid s_t=s,a_t=a\}$ ，预期奖励 $R_s^a=E\{r_{t+1}\mid s_t=s,a_t=a\}, \forall s,s'\in S,a\in A$ 。每次agent的决策程序都以策略为特征， $\pi(s,a,\theta)=P_r\{a_t=a\mid s_t=s,\theta\}, \forall s\in S, a\in A$ ,where $\theta\in\Re^l$ ,for $l\ll |S|$ ，是一个参数向量。我们假设 $\pi$ 在其参数方面是不同的，即 $\frac{\partial\pi(s,a)}{\partial\theta}$ 存在。我们通常也只为 $\pi(s,a,\theta)$ 写 $\pi(s,a)$ 。

We consider the standard reinforcement learning framework (see, e.g., Sutton and Barto, 1998), in which a learning agent interacts with a Markov decision process (MDP). The state, action, and reward at each time $t\in\{0,1,2,...\}$ are denoted $s_t\in S$ , $a_t\in A$ and $r_t\in\Re$ respectively. The environment’s dynamics are characterized by state transition probabilities, $P_{ss'}^a=P_r\{s_{t+1}=s'\mid s_t=s,a_t=a\}$ , and expected rewards $R_s^a=E\{r_{t+1}\mid s_t=s,a_t=a\}, \forall s,s'\in S,a\in A$ . The agent’s decision making procedure at each time is characterized by a policy, $\pi(s,a,\theta)=P_r\{a_t=a\mid s_t=s,\theta\}, \forall s\in S, a\in A$ ,where $\theta\in\Re^l$ ,for $l\ll |S|$ , is a parameter vector. We assume that $\pi$ is difientiable with respect to its parameter, i.e., that $\frac{\partial\pi(s,a)}{\partial\theta}$ exists. We also usually write just $\pi(s,a)$ for $\pi(s,a,\theta)$ .

使用函数逼近，有两种方法可以制定agent的目标。一个是平均奖励制定，其中策略根据其每步的长期预期奖励进行排名， $\rho(\pi)$ ：

With function approximation, two ways of formulating the agent’s objective are useful. One is the average reward formulation, in which policies are ranked according to their long-term expected reward per step, $\rho(\pi)$ :

$\rho(\pi)=\lim\limits_{n\to\infty}\frac{1}{n}E\{r_1+r_2+...+r_n|\pi|=\displaystyle\sum_{s}d^\pi(s)\displaystyle\sum_{a}\pi(s,a)R_s^a\}$

其中 $d^\pi(s)=\lim\limits_{t\to\infty}Pr\{s_t=s\mid s_0,\pi\}$ 是 $\pi$ 下的状态的固定分布，我们假设存在且与所有策略的 $s_0$ 无关。在平均奖励公式中，给定策略的state-action对的值被定义为

where $d^\pi(s)=\lim\limits_{t\to\infty}Pr\{s_t=s\mid s_0,\pi\}$ is the stationary distribution of states under $\pi$ , which we assume exists and is independent of $s_0$ for all policies. In the average reward formulation, the value of a state-action pair given a policy is deflned as

$Q^\pi(s,a)=\displaystyle\sum_{t=1}^\infty E\{r_t-\rho(\pi)\mid s_0=s,a_0=a,\pi\}, \forall s\in S,a\in A$

我们涵盖的第二个公式是指定的开始状态 $s_0$ ，我们只关心从中获得的长期奖励。我们将只给出我们的结果一次，但它们将适用于这个公式以及定义

The second formulation we cover is that in which there is a designated start state $s_0$ , and we care only about the long-term reward obtained from it. We will give our results only once, but they will apply to this formulation as well under the definitions

$\rho(\pi)=E\{\displaystyle\sum_{t=1}^\infty\gamma^{t-1}r_t\mid s_0,\pi\}$
and
$Q^\pi(s,a)=E\{\displaystyle\sum_{t=1}^\infty\gamma^{k-1}r_{t+k}\mid s_t=s,a_t=a,\pi\}$

其中 $\gamma\in[0,1]$ 是折扣率（ $γ = 1$ 只允许在episodic任务中使用）。在这个公式中，我们定义 $d^\pi(s)$ 作为从 $s_0$ 开始遇到的状态的折扣加权，然后跟随 $\pi: d^\pi(s)=\textstyle\sum_{i=0}^\infty\gamma^tPr\{s_t=s\mid s_0,\pi\}$

where $\gamma\in[0,1]$ is a discount rate ( $γ = 1$ is allowed only in episodic tasks). In this formulation, we deflne $d^\pi(s)$ as a discounted weighting of states encountered starting at $s_0$ and then following $\pi: d^\pi(s)=\textstyle\sum_{i=0}^\infty\gamma^tPr\{s_t=s\mid s_0,\pi\}$

我们的第一个结果涉及与策略参数相关的性能指标的梯度：

Our first result concerns the gradient of the performance metric with respect to the policy parameter:

定理1（策略梯度） 对于任何MDP，无论是平均奖励还是开始状态，

Theorem 1 (Policy Gradient). For any MDP, in either the average-reward or start-state formulations,

$\tag{2} \frac{\partial\rho}{\partial\theta}=\displaystyle\sum_{s}d^\pi(s)\displaystyle\sum_{a}\frac{\partial\pi(s,a)}{\partial\theta}Q^\pi(s,a)$

证明：见附录。

Proof: See the appendix

基于Jaakkola，Singh和Jordan（1995）和Cao陈（1997）等人的状态值函数的相关表达式，这种表达梯度的方式首先讨论了Marbach和Tsitsiklis（1998）的平均奖励公式。我们将其结果扩展到起始状态公式，并提供更简单，更直接的证明。威廉姆斯（1988,1992）的REINFORCE算法理论也可以被视为暗示（2）。

This way of expressing the gradient was first discussed for the average-reward formulation by Marbach and Tsitsiklis (1998), based on a related expression in terms of the state-value function due to Jaakkola, Singh, and Jordan (1995) and Cao and Chen (1997). We extend their results to the start-state formulation and provide simpler and more direct proofs. Williams’s (1988, 1992) theory of REINFORCE algorithms can also be viewed as implying (2).

在任何情况下，这两个梯度表达式的关键之处在于它们都没有形式 $\frac{\partial d^\pi(s)}{\partial\theta}$ 的项：策略变化对状态分布的影响没有出现。这样便于通过采样逼近梯度。例如，如果 $s$ 是通过跟随 $\pi$ 获得的分布中的样本，那么 $\textstyle\sum_{a}\frac{\partial\pi(s,a)}{\partial\theta}Q^\pi$ 将是 $\frac{\rho}{\theta}$ 的无偏估计。当然， $Q^\pi(s,a)$ 通常也不知道，必须估算。一种方法是使用实际回报， $R_t=\textstyle\sum_{k=1}^\infty r_{t+k}-\rho(\pi)$ （或 $R_t=\textstyle\sum_{k=1}^\infty \gamma^{k-1}r_{t+k}$ 在起始状态公式中）作为每个 $Q^\pi(s_t,a_t)$ 的近似值。这导致了Williams的episodic REINFORCE算法， $\nabla\theta_t\propto\frac{\partial_\pi(s_t,a_t)}{\partial\theta}R_t\frac{1}{\pi(s_t,a_t)}$ ( $\frac{1}{\pi(s_t,a_t)}$ 纠正了 $\pi$ )首选操作的过采样，已知该操作的期望值紧跟在 $\frac{\partial\rho}{\partial\theta}$ 之后（Williams，1988,1992）。

In any event, the key aspect of both expressions for the gradient is that their are no terms of the form $\frac{\partial d^\pi(s)}{\partial\theta}$ : the effect of policy changes on the distribution of states does not appear. This is convenient for approximating the gradient by sampling. For example, if s was sampled from the distribution obtained by following $\pi$ , then $\textstyle\sum_{a}\frac{\partial\pi(s,a)}{\partial\theta}Q^\pi$ would be an unbiased estimate of $\frac{\rho}{\theta}$ . Of course, $Q^\pi(s,a)$ is also not normally known and must be estimated. One approach is to use the actual returns, $R_t=\textstyle\sum_{k=1}^\infty r_{t+k}-\rho(\pi)$ (or $R_t=\textstyle\sum_{k=1}^\infty \gamma^{k-1}r_{t+k}$ in the start-state formulation) as an approximation for each $Q^\pi(s_t,a_t)$ . This leads to Williams’s episodic REINFORCE algorithm, $\nabla\theta_t\propto\frac{\partial_\pi(s_t,a_t)}{\partial\theta}R_t\frac{1}{\pi(s_t,a_t)}$ (the $\frac{1}{\pi(s_t,a_t)}$ corrects for the oversampling of actions preferred by $\pi$ ), which is known to follow $\frac{\partial\rho}{\partial\theta}$ in expected value (Williams, 1988, 1992).

2 Policy Gradient with Approximation

REFERENCES

菜菜菜菜菜菜菜

发布了43 篇原创文章 · 获赞 19 · 访问量 8523

私信关注