机器学习-白板推导系列笔记(三十五)-DP

此文章主要是结合哔站shuhuai008大佬的白板推导视频:动态学习_108min

全部笔记的汇总贴:机器学习-白板推导系列笔记

一、策略迭代

(一)策略评估

已知MPP, P ( s ′ , r ∣ s , a ) P(s',r|s,a) P(s,rs,a)
给定 π \pi π,求 V π          ( ∀ s ∈ S ) V_\pi\;\;\;\;(\forall s\in S) Vπ(sS)
V π = ( V π ( s 1 ) V π ( s 2 ) ⋮ V π ( s ∣ S ∣ ) ) ∣ S ∣ ∗ 1              r π = ( r π ( s 1 ) r π ( s 2 ) ⋮ r π ( s ∣ S ∣ ) ) ∣ S ∣ ∗ 1 P π ≜ [ p π ( s , s ′ ) ] ∣ S ∣ ∗ ∣ S ∣ V_\pi=\begin{pmatrix} V_\pi(s_1) \\V_\pi(s_2) \\\vdots\\V_\pi(s_{|S|}) \end{pmatrix}_{|S|*1}\;\;\;\;\;\;r_\pi=\begin{pmatrix} r_\pi(s_1) \\r_\pi(s_2) \\\vdots\\r_\pi(s_{|S|}) \end{pmatrix}_{|S|*1}\\P_\pi\triangleq[p_\pi(s,s')]_{|S|*|S|} Vπ=Vπ(s1)Vπ(s2)Vπ(sS)S1rπ=rπ(s1)rπ(s2)rπ(sS)S1Pπ[pπ(s,s)]SS
所以, V π ( s ) = E π [ G t ∣ S t = s ] = E π [ R t + 1 + γ V π ( s t + 1 ) ∣ s t = s ] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ( s ′ ) ] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) r ⏟ ( 1 ) + γ ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) V π ( s ′ ) ⏟ ( 2 ) V_\pi(s)=E_\pi[G_t|S_t=s]\\=E_\pi[R_{t+1}+\gamma V_{\pi}(s_{t+1})|s_t=s]\\=\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma V_\pi(s')]\\=\begin{matrix} \underbrace{ \sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)r } \\ (1) \end{matrix}+\begin{matrix} \underbrace{\gamma \sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)V_\pi(s')} \\ (2) \end{matrix} Vπ(s)=Eπ[GtSt=s]=Eπ[Rt+1+γVπ(st+1)st=s]=aπ(as)s,rp(s,rs,a)[r+γVπ(s)]= aπ(as)s,rp(s,rs,a)r(1)+ γaπ(as)s,rp(s,rs,a)Vπ(s)(2)
( 1 ) = ∑ a π ( a ∣ s ) ⋅ ∑ r r p ( r ∣ s , a ) ⏟ r ( s , a ) ≜ E π [ R t + 1 ∣ s t = s , A t = a ] = ∑ a π ( a ∣ s ) ⋅ r ( s , a ) ≜ r π ( s ) (1)=\sum_a\pi(a|s)\cdot\underset{ r(s,a)\triangleq E_\pi[R_{t+1}|s_t=s,A_t=a]}{\underbrace{\sum_{r}rp(r|s,a)}}\\=\sum_a\pi(a|s)\cdot r(s,a)\\\triangleq r_\pi(s) (1)=aπ(as)r(s,a)Eπ[Rt+1st=s,At=a] rrp(rs,a)=aπ(as)r(s,a)rπ(s)
( 2 ) = γ ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) V π ( s ′ ) = γ ∑ s ′ ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) ⏟ P π ( s , s ′ ) V π ( s ′ ) = γ ∑ s ′ P π ( s , s ′ ) V π ( s ′ ) (2)=\gamma \sum_a\pi(a|s)\sum_{s'}p(s'|s,a)V_\pi(s')\\=\gamma \sum_{s'}\underset{P_\pi(s,s')}{\underbrace{\sum_a\pi(a|s)p(s'|s,a)}}V_\pi(s')\\=\gamma \sum_{s'}P_\pi(s,s')V_\pi(s') (2)=γaπ(as)sp(ss,a)Vπ(s)=γsPπ(s,s) aπ(as)p(ss,a)Vπ(s)=γsPπ(s,s)Vπ(s)
于是, V π ( s ) = r π ( s ) + γ ∑ s ′ P π ( s , s ′ ) V π ( s ′ ) V_\pi(s)= r_\pi(s)+\gamma \sum_{s'}P_\pi(s,s')V_\pi(s') Vπ(s)=rπ(s)+γsPπ(s,s)Vπ(s)

1.解析解

用矩阵的表达形式来解。

s i ≜ s , s j ≜ s ′ s_i\triangleq s,s_j\triangleq s' sis,sjs,得到: V π ( s i ) = r π ( s i ) + γ ∑ j = 1 ∣ S ∣ P π ( s i , s j ) V π ( s j ) V π = r π + γ P π V π ( I − γ P π ) V π = r π V π = ( I − γ P π ) − 1 r π V_\pi(s_i)= r_\pi(s_i)+\gamma \sum_{j=1}^{|S|}P_\pi(s_i,s_j)V_\pi(s_j)\\V_\pi=r_\pi+\gamma P_\pi V_\pi\\(I-\gamma P_\pi)V_\pi=r_\pi\\V_\pi=(I-\gamma P_\pi)^{-1}r_\pi Vπ(si)=rπ(si)+γj=1SPπ(si,sj)Vπ(sj)Vπ=rπ+γPπVπ(IγPπ)Vπ=rπVπ=(IγPπ)1rπ
复杂度为: O ( ∣ S ∣ 3 ) O(|S|^3) O(S3)

2.迭代解(数值解)

构造一个数列,让这个数列收敛于 V π V_\pi Vπ,收敛性证明暂时不管。
lim ⁡ k → ∞ { V k } = V π \lim_{k\rightarrow\infty}\{V_k\}=V_\pi klim{ Vk}=Vπ
我们根据贝尔曼方程就可以得到, V k + 1 ( s ) ≜ ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V k ( s ′ ) ] V_{k+1}(s)\triangleq \sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma V_k(s')] Vk+1(s)aπ(as)s,rp(s,rs,a)[r+γVk(s)]

(二)策略改进

策略改进对应的是贝尔曼最优方程。

1.策略改进定理

给定 π , π ′ \pi,\pi' π,π,如果 ∀ s ∈ S , q π ( s , π ′ ( s ) ) ≥ V π ( s ) \forall s\in S,q_\pi(s,\pi'(s))\ge V_\pi(s) sS,qπ(s,π(s))Vπ(s),那么则有 ∀ s ∈ S , V π ′ ( s ) ≥ V π ( s ) \forall s\in S,V_{\pi'}(s)\ge V_\pi(s) sS,Vπ(s)Vπ(s)

证明:

∀ s ∈ S ( q π ( s , a ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ( s ′ ) ] = E [ R t + 1 + γ V π ( s t + 1 ) ∣ s t = s , A t = a ] ) \forall s\in S\\(q_\pi(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma V_\pi(s')]=E[R_{t+1}+\gamma V_\pi(s_{t+1})|s_t=s,A_t=a]) sS(qπ(s,a)=s,rp(s,rs,a)[r+γVπ(s)]=E[Rt+1+γVπ(st+1)st=s,At=a])
V π ( s ) ≤ q π ( s , π ′ ( s ) ) = E [ R t + 1 + γ V π ( s t + 1 ) ∣ s t = s , A t = π ′ ( s ) ] = E π ′ [ R t + 1 + γ V π ( s t + 1 ) ∣ s t = s ] ≤ E π ′ [ R t + 1 + γ q π ( s ( t + 1 ) , π ′ ( s t + 1 ) ) ∣ s t = s ] = E π ′ [ R t + 1 + γ E π ′ [ R t + 2 + γ V π ( s t + 2 ) ∣ s t + 1 ] ∣ s t = s ] = E π ′ [ R t + 1 + γ E π ′ [ R t + 2 ∣ s t + 1 ] + γ 2 E π ′ [ V π ( s t + 2 ) ∣ s t + 1 ] ∣ s t = s ] = E π ′ [ R t + 1 + γ R t + 2 + γ 2 V π ( s t + 2 ) ∣ s t = s ] ⋯ ≤ E π ′ [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ + ⋯ ⏟ G t ∣ s t = s ] = V π ′ ( s ) V_\pi(s)\le q_\pi(s,\pi'(s))\\=E[R_{t+1}+\gamma V_\pi(s_{t+1})|s_t=s,A_t=\pi'(s)]\\=E_{\pi'}[R_{t+1}+\gamma V_\pi(s_{t+1})|s_t=s]\\\le E_{\pi'}[R_{t+1}+\gamma q_\pi(s_{(t+1)},\pi'(s_{t+1}))|s_t=s]\\=E_{\pi'}\Big[R_{t+1}+\gamma E_{\pi'}[R_{t+2}+\gamma V_\pi(s_{t+2})|s_{t+1}]\Big|s_t=s\Big]\\=E_{\pi'}\Big[R_{t+1}+\gamma E_{\pi'}[R_{t+2}|s_{t+1}]+\gamma^2 E_{\pi'}[V_\pi(s_{t+2})|s_{t+1}]\Big|s_t=s\Big]\\=E_{\pi'}\Big[R_{t+1}+\gamma R_{t+2}+\gamma^2 V_\pi(s_{t+2})\Big|s_t=s\Big]\\\cdots\\\le E_{\pi'}\Big[\underset{G_t}{\underbrace{R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots+\cdots}}\Big|s_t=s\Big]\\=V_{\pi'}(s) Vπ(s)qπ(s,π(s))=E[Rt+1+γVπ(st+1)st=s,At=π(s)]=Eπ[Rt+1+γVπ(st+1)st=s]Eπ[Rt+1+γqπ(s(t+1),π(st+1))st=s]=Eπ[Rt+1+γEπ[Rt+2+γVπ(st+2)st+1]st=s]=Eπ[Rt+1+γEπ[Rt+2st+1]+γ2Eπ[Vπ(st+2)st+1]st=s]=Eπ[Rt+1+γRt+2+γ2Vπ(st+2)st=s]Eπ[Gt Rt+1+γRt+2+γ2Rt+3++st=s]=Vπ(s)

2.贪心策略

在这里插入图片描述在这里插入图片描述

Greedy Policy: ∀ s ∈ S \forall s\in S sS, π ′ ( s ) = arg max ⁡ a q π ( s , a ) \pi'(s)=\argmax_aq_\pi(s,a) π(s)=aargmaxqπ(s,a)

∀ s ∈ S , V π ( s ) ≤ max ⁡ a q π ( s , a ) = q π ( s , π ′ ( s ) ) \forall s\in S,V_\pi(s)\le \max_a q_\pi(s,a)=q_\pi(s,\pi'(s)) sS,Vπ(s)amaxqπ(s,a)=qπ(s,π(s))
由策略改进定理可知: ∀ s ∈ S ,            V π ′ ( s ) ≥ V π ( s ) \forall s\in S,\;\;\;\;\;V_{\pi'}(s)\ge V_\pi(s) sS,Vπ(s)Vπ(s)

If V π ′ = V π V_{\pi'}=V_\pi Vπ=Vπ,then V π ′ = V π = V ∗ V_{\pi'}=V_\pi=V_* Vπ=Vπ=V,证明如下:

V π ′ = V π                →            q π ′ = q π ∀ s ∈ S ,        V π ′ ( s ) = ∑ a π ′ ( a ∣ s ) q π ′ ( s , a ) = ∑ a π ′ ( a ∣ s ) q π ( s , a ) = q π ( s , π ′ ( s ) ) = max ⁡ a q π ( s , a ) = max ⁡ a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ( s ′ ) ] = max ⁡ a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ′ ( s ′ ) ] V_{\pi'}=V_\pi\;\;\;\;\;\;\;\rightarrow\;\;\;\;\;q_{\pi'}=q_\pi\\\forall s\in S,\;\;\;V_{\pi'}(s)=\sum_a\pi'(a|s)q_{\pi'}(s,a)\\=\sum_a\pi'(a|s)q_{\pi}(s,a)\\=q_\pi(s,\pi'(s))\\=\max_aq_\pi(s,a)\\=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma V_\pi(s')]\\=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma V_{\pi'}(s')] Vπ=Vπqπ=qπsS,Vπ(s)=aπ(as)qπ(s,a)=aπ(as)qπ(s,a)=qπ(s,π(s))=amaxqπ(s,a)=amaxs,rp(s,rs,a)[r+γVπ(s)]=amaxs,rp(s,rs,a)[r+γVπ(s)]
于是,
V π ′ ( s ) = max ⁡ a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ′ ( s ′ ) ] V_{\pi'}(s)=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma V_{\pi'}(s')] Vπ(s)=amaxs,rp(s,rs,a)[r+γVπ(s)]
又因为: V ∗ ( s ) = max ⁡ a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V ∗ ( s ′ ) ] V_*(s)=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma V_*(s')] V(s)=amaxs,rp(s,rs,a)[r+γV(s)]
所以, V π ′ = V ∗ V_{\pi'}=V_* Vπ=V
即, V π ′ = V π = V ∗ V_{\pi'}=V_\pi=V_* Vπ=Vπ=V

二、价值迭代

是极端情况下的策略迭代,策略评估只进行一步,然后策略改进。

价值迭代虽然只走一步,但是还是会更新 S S S中的所有状态。比价值迭代更简单的是就地策略迭代(异步策略迭代的特例),只选取 S S S中的一个状态进行一步更新。

   \;
   \;
   \;
   \;
   \;

完结!!!再贴一下汇总贴:机器学习-白板推导系列笔记

猜你喜欢

转载自blog.csdn.net/qq_41485273/article/details/112397426
今日推荐