基于Stochastic Policy的深度强化学习方法

在这里插入图片描述

  在开始说基于Stochastic Policy的方法之前,我们需要了解一下Policy Gradient的方法。在Policy Gradient里面有一个非常重要的定理:Policy Gradient Theorem。

  • Theorem

  For any differentiable policy π θ ( a s ) \pi_{\theta}(a|s) , for any of policy objective function J = J 1 J a v R J a v V J = J_{1},J_{avR},J_{avV} , the policy gradient is:

J ( θ ) θ = E π θ [ log π θ ( a s ) θ Q π θ ( s , a ) ] \frac{\partial J(\theta)}{\partial \theta} = \mathbb{E}_{\pi_{\theta}}[\frac{\partial \text{log}\pi_{\theta}(a|s)}{\partial \theta}Q^{\pi_{\theta}}(s,a)]

  上面这个式子也是Stochastic Policy里面的核心梯度公式,你可以不需要证明怎么来的,但是需要理解它背后的思想。

Policy Network Gradients

  如果我们的PolicyNeural Network来近似拟合,最后一层使用Softmax的话,那输出的动作概率可表示为如下函数形式:

π θ ( a s ) = e f θ ( s , a ) a e f θ ( s , a ) \pi_{\theta}(a | s)=\frac{e^{f_{\theta}(s, a)}}{\sum_{a^{\prime}} e^{f_{\theta}\left(s, a^{\prime}\right)}}

  其中 f θ ( s , a ) f_{\theta}(s,a) is the score function of a state-action pair parametrized by θ \theta , which can be implemented with a neural net

  The gradient of its log-form可表示为:

log π θ ( a s ) θ = f θ ( s , a ) θ 1 a e f θ , a ) a e f θ ( s , a ) f θ ( s , a ) θ = f θ ( s , a ) θ E a π θ ( a s ) [ f θ ( s , a ) θ ] \begin{aligned} \frac{\partial \log \pi_{\theta}(a | s)}{\partial \theta} &=\frac{\partial f_{\theta}(s, a)}{\partial \theta}-\frac{1}{\left.\sum_{a^{\prime}} e^{f_{\theta}, a^{\prime}}\right)} \sum_{a^{\prime \prime}} e^{f_{\theta}\left(s, a^{\prime \prime}\right)} \frac{\partial f_{\theta}\left(s, a^{\prime \prime}\right)}{\partial \theta} \\ &=\frac{\partial f_{\theta}(s, a)}{\partial \theta}-\mathbb{E}_{a^{\prime} \sim \pi_{\theta}\left(a^{\prime} | s\right)}\left[\frac{\partial f_{\theta}\left(s, a^{\prime}\right)}{\partial \theta}\right] \end{aligned}

  上述公式中最后一个等式很有意思,它是score function对某一个具体的动作 a a 求梯度,然后减去对所有动作的梯度的期望。(可以思考一下其背后的含义)。

Looking into Policy Gradient

  在Policy Network Gradients里面我们将策略引入Softmax函数进行了拆解求导,得到一个梯度,这个梯度还需要去乘上优势函数再拿去做更新,所以这里需要回顾一下Policy Gradient方法。

  • Let R ( π ) R(\pi) denote the expected return of π \pi

R ( π ) = E s 0 ρ 0 , a t π ( s t ) [ t = 0 γ t r t ] R(\pi) = \mathbb{E}_{s_{0} \sim \rho_{0},a_{t} \sim \pi(\cdot|s_{t})} [\sum_{t=0}^{\infty} \gamma^{t}r_{t}]

  • We collect experience data with another policy π o l d \pi_{old} ,and want to optimize some objective to get a new better policy π \pi

  由于强化学习的学习过程需要大量数据,所以我们必须从提高数据使用率,因此过去的policy采集所得到的数据,还是要拿过来用,由此也是off-policy的一种方法。

  • Note that a useful identity

R ( π ) = R ( π o l d ) + E τ π [ t = 0 γ t A π o l d ( s t , a t ) ] R(\pi) = R(\pi_{old}) + \mathbb{E}_{\tau \sim \pi}[\sum_{t=0}^{\infty}\gamma^{t} A^{\pi_{old}}(s_{t},a_{t})]

  其中的 E τ π \mathbb{E}_{\tau \sim \pi} 可以Trajectories sampled from π \pi A π o l d A^{\pi_{old}} 表示的是优势函数。

  Advantage function展开表示如下:

A π o l d ( s , a ) = E s ρ ( s s , a ) [ r ( s ) + γ V π o l d ( s ) V π o l d ( s ) ] A^{\pi_{old}}(s,a) = \mathbb{E}_{s^{\prime} \sim \rho(s^{\prime}|s,a)}[r(s) + \gamma V^{\pi_{old}}(s^{\prime} )-V^{\pi_{old}}(s)]

很多时候你也会看到一些简洁的表达,比如 A ( s , a ) = Q ( s , a ) V ( s ) A(s,a) = Q(s,a) -V(s) ,其中 V ( s ) V(s) 描述的是baseline的思想,而 A ( s , a ) A(s,a) 表达的是每个动作选择的好坏,而 V π ( s ) = a π ( a s ) Q π ( s , a ) V^{\pi}(s) = \sum_{a}\pi(a|s)Q^{\pi}(s,a)

Proof

R ( π ) = R ( π o l d ) + E τ π [ t = 0 γ t A π o l d ( s t , a t ) ] R(\pi) = R(\pi_{old}) + \mathbb{E}_{\tau \sim \pi}[\sum_{t=0}^{\infty}\gamma^{t} A^{\pi_{old}}(s_{t},a_{t})]

  对上述等式证明:

E τ π [ t = 0 γ t A π o l d ( s t , a t ) ] = E τ π [ t = 0 γ t ( r ( s t ) + γ V π o l d ( s t + 1 ) V π o l d ( s t ) ) ] = E τ π [ V π o l d ( s 0 ) + t = 0 γ t r ( s t ) ] = E s 0 [ V π o l d ( s 0 ) ] + E τ π [ t = 0 γ t r ( s t ) ] = R ( π o l d ) + R ( π ) \begin{aligned} \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^{t} A^{\pi_{\mathrm{old}}}\left(s_{t}, a_{t}\right)\right] &=\mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^{t}\left(r\left(s_{t}\right)+\gamma V^{\pi_{\mathrm{old}}}\left(s_{t+1}\right)-V^{\pi_{\mathrm{old}}\left(s_{t}\right)}\right)\right] \\ &=\mathbb{E}_{\tau \sim \pi}\left[-V^{\pi_{\mathrm{old}}}\left(s_{0}\right)+\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}\right)\right] \\ &=-\mathbb{E}_{s_{0}}\left[V^{\pi_{\mathrm{old}}}\left(s_{0}\right)\right]+\mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}\right)\right]\\ &=-R\left(\pi_{\mathrm{old}}\right)+R(\pi) \end{aligned}

  那上述公式的直观理解是什么样子的呢?相当于是在老的policy基础之上做一点改进,而改进的优势函数又是基于old policy所得出来的,但action是基于新的策略 π \pi 选出来的,所以及时奖励会去修正原来Old policy得出来的 V V

  • S. Kakadeand J. Langford. Approximately optimal approximate reinforcement learning. ICML. 2002.
More for the Policy Expected Return

  我们再来分析一下advantage function:

A π o l d ( s , a ) = E s ρ ( s s , a ) [ r ( s ) + γ V π o l d ( s ) V π o l d ( s ) ] A^{\pi_{old}}(s,a) = \mathbb{E}_{s^{\prime} \sim \rho(s^{\prime}|s,a)}[r(s) + \gamma V^{\pi_{old}}(s^{\prime} )-V^{\pi_{old}}(s)]

  Want to manipulate R ( π ) R(\pi) into an objective that can be estimated from data :

R ( π ) = R ( π old  ) + E τ π [ t = 0 γ t A π old  ( s t , a t ) ] = R ( π old  ) + t = 0 s P ( s t = s π ) a π ( a s ) γ t A π old  ( s , a ) = R ( π old  ) + s t = 0 γ t P ( s t = s π ) a π ( a s ) A π old  ( s , a ) = R ( π old  ) + s ρ π ( s ) a π ( a s ) A π old  ( s , a ) \begin{aligned} R(\pi) &=R\left(\pi_{\text {old }}\right)+\mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^{t} A^{\pi_{\text {old }}}\left(s_{t}, a_{t}\right)\right] \\ &=R\left(\pi_{\text {old }}\right)+\sum_{t=0}^{\infty} \sum_{s} P\left(s_{t}=s | \pi\right) \sum_{a} \pi(a | s) \gamma^{t} A^{\pi_{\text {old }}}(s, a) \\ &=R\left(\pi_{\text {old }}\right)+\sum_{s} \sum_{t=0}^{\infty} \gamma^{t} P\left(s_{t}=s | \pi\right) \sum_{a} \pi(a | s) A^{\pi_{\text {old }}}(s, a) \\ &=R\left(\pi_{\text {old }}\right)+\sum_{s} \rho_{\pi}(s) \sum_{a} \pi(a | s) A^{\pi_{\text {old }}}(s, a) \end{aligned}

  其中 ρ π ( s ) \rho_{\pi}(s) 表示的是基于当前策略 π \pi s s 被采样采中的概率,乘以它是哪一次被采中的概率 γ t \gamma^{t}

  但是上述等式依然要从新的策略 π \pi 里面去sample数据,当策略改变之后,要重新sample数据才可以更新。为了解决这个问题,引入重要性采样。

R ( π ) = R ( π old  ) + s ρ π ( s ) a π ( a s ) A π old  ( s , a ) = R ( π old  ) + E s π , a π [ A π old  ( s , a ) ] = R ( π old  ) + E s π , a π old  [ π ( a s ) π old  ( a s ) A π old  ( s , a ) ] \begin{aligned} R(\pi) &=R\left(\pi_{\text {old }}\right)+\sum_{s} \rho_{\pi}(s) \sum_{a} \pi(a | s) A^{\pi_{\text {old }}}(s, a) \\ &=R\left(\pi_{\text {old }}\right)+\mathbb{E}_{s \sim \pi, a \sim \pi}\left[A^{\pi_{\text {old }}}(s, a)\right] \\ &=R\left(\pi_{\text {old }}\right)+\mathbb{E}_{s \sim \pi, a \sim \pi_{\text {old }}}\left[\frac{\pi(a | s)}{\pi_{\text {old }}(a | s)} A^{\pi_{\text {old }}}(s, a)\right] \end{aligned}

  也就是state s s 从新的policy采样, a a old policy采样。上面这一步是完全恒等,但是 s s 依然需要从新的policy上面来采样。

Surrogate Loss Function
  • Define a surrogate loss function based on sampled data that ignores change in state distribution :

L ( π ) = E s π o l d , a π old  [ π ( a s ) π old  ( a s ) A π old  ( s , a ) ] L(\pi) = \mathbb{E}_{s \sim \pi_{old}, a \sim \pi_{\text {old }}}\left[\frac{\pi(a | s)}{\pi_{\text {old }}(a | s)} A^{\pi_{\text {old }}}(s, a)\right]

  此时数据可以完全由old ploicy采样得到。这个可替代(surrogate)的loss function与之前的loss function区别就在于state是从哪个policy sample出来的。上述等式能够替代的条件是old policy和新的policy差距不要太大。

  • 小节

  现在我们来对上述过程做一个小结:

  开始我们是有一个Target function

R ( π ) = R ( π old  ) + E s π , a π [ π ( a s ) A π old  ( s , a ) ] R(\pi) =R\left(\pi_{\text {old }}\right)+\mathbb{E}_{s \sim \pi, a \sim \pi}[ \pi(a | s) A^{\pi_{\text {old }}}(s, a) ]

  之后通过重要性采样,使得其在old policy上采样:

L ( π ) = E s π o l d , a π old  [ π ( a s ) π old  ( a s ) A π old  ( s , a ) ] L(\pi) = \mathbb{E}_{s \sim \pi_{old}, a \sim \pi_{\text {old }}}\left[\frac{\pi(a | s)}{\pi_{\text {old }}(a | s)} A^{\pi_{\text {old }}}(s, a)\right]

  然后我们对其进行求导:

θ L ( π θ ) θ old  = E s π old  , a π old  [ θ π θ ( a s ) π old  ( a s ) A π old  ( s , a ) ] θ old  = E s π old  , a π old  [ π θ ( a s ) θ log π θ ( a s ) π old  ( a s ) A π old  ( s , a ) ] θ old  = E s π old  , a π θ [ θ log π θ ( a s ) A π old  ( s , a ) ] θ old  = θ R ( π θ ) θ old  \begin{aligned} \left.\nabla_{\theta} L\left(\pi_{\theta}\right)\right|_{\theta_{\text {old }}} &=\left.\mathbb{E}_{s \sim \pi_{\text {old }}, a \sim \pi_{\text {old }}}\left[\frac{\nabla_{\theta} \pi_{\theta}(a | s)}{\pi_{\text {old }}(a | s)} A^{\pi_{\text {old }}}(s, a)\right]\right|_{\theta_{\text {old }}} \\ &=\left.\mathbb{E}_{s \sim \pi_{\text {old }}, a \sim \pi_{\text {old }}}\left[\frac{\pi_{\theta}(a | s) \nabla_{\theta} \log \pi_{\theta}(a | s)}{\pi_{\text {old }}(a | s)} A^{\pi_{\text {old }}}(s, a)\right]\right|_{\theta_{\text {old }}} \\ &=\left.\mathbb{E}_{s \sim \pi_{\text {old }}, a \sim \pi_{\theta}}\left[\nabla_{\theta} \log \pi_{\theta}(a | s) A^{\pi_{\text {old }}}(s, a)\right]\right|_{\theta_{\text {old }}} \\ &=\left.\nabla_{\theta} R\left(\pi_{\theta}\right)\right|_{\theta_{\text {old }}} \end{aligned}

我的微信公众号名称:深度学习与先进智能决策
微信公众号ID:MultiAgent1024
公众号介绍:主要研究分享深度学习、机器博弈、强化学习等相关内容!期待您的关注,欢迎一起学习交流进步!

发布了185 篇原创文章 · 获赞 168 · 访问量 21万+

猜你喜欢

转载自blog.csdn.net/weixin_39059031/article/details/104504472