此文章主要是结合哔站shuhuai008大佬的白板推导视频:动态学习_108min
全部笔记的汇总贴:机器学习-白板推导系列笔记
一、策略迭代
(一)策略评估
已知MPP, P ( s ′ , r ∣ s , a ) P(s',r|s,a) P(s′,r∣s,a)
给定 π \pi π,求 V π ( ∀ s ∈ S ) V_\pi\;\;\;\;(\forall s\in S) Vπ(∀s∈S)
记 V π = ( V π ( s 1 ) V π ( s 2 ) ⋮ V π ( s ∣ S ∣ ) ) ∣ S ∣ ∗ 1 r π = ( r π ( s 1 ) r π ( s 2 ) ⋮ r π ( s ∣ S ∣ ) ) ∣ S ∣ ∗ 1 P π ≜ [ p π ( s , s ′ ) ] ∣ S ∣ ∗ ∣ S ∣ V_\pi=\begin{pmatrix} V_\pi(s_1) \\V_\pi(s_2) \\\vdots\\V_\pi(s_{|S|}) \end{pmatrix}_{|S|*1}\;\;\;\;\;\;r_\pi=\begin{pmatrix} r_\pi(s_1) \\r_\pi(s_2) \\\vdots\\r_\pi(s_{|S|}) \end{pmatrix}_{|S|*1}\\P_\pi\triangleq[p_\pi(s,s')]_{|S|*|S|} Vπ=⎝⎜⎜⎜⎛Vπ(s1)Vπ(s2)⋮Vπ(s∣S∣)⎠⎟⎟⎟⎞∣S∣∗1rπ=⎝⎜⎜⎜⎛rπ(s1)rπ(s2)⋮rπ(s∣S∣)⎠⎟⎟⎟⎞∣S∣∗1Pπ≜[pπ(s,s′)]∣S∣∗∣S∣
所以, V π ( s ) = E π [ G t ∣ S t = s ] = E π [ R t + 1 + γ V π ( s t + 1 ) ∣ s t = s ] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ( s ′ ) ] = ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) r ⏟ ( 1 ) + γ ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) V π ( s ′ ) ⏟ ( 2 ) V_\pi(s)=E_\pi[G_t|S_t=s]\\=E_\pi[R_{t+1}+\gamma V_{\pi}(s_{t+1})|s_t=s]\\=\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma V_\pi(s')]\\=\begin{matrix} \underbrace{ \sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)r } \\ (1) \end{matrix}+\begin{matrix} \underbrace{\gamma \sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)V_\pi(s')} \\ (2) \end{matrix} Vπ(s)=Eπ[Gt∣St=s]=Eπ[Rt+1+γVπ(st+1)∣st=s]=a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γVπ(s′)]=
a∑π(a∣s)s′,r∑p(s′,r∣s,a)r(1)+
γa∑π(a∣s)s′,r∑p(s′,r∣s,a)Vπ(s′)(2)
( 1 ) = ∑ a π ( a ∣ s ) ⋅ ∑ r r p ( r ∣ s , a ) ⏟ r ( s , a ) ≜ E π [ R t + 1 ∣ s t = s , A t = a ] = ∑ a π ( a ∣ s ) ⋅ r ( s , a ) ≜ r π ( s ) (1)=\sum_a\pi(a|s)\cdot\underset{ r(s,a)\triangleq E_\pi[R_{t+1}|s_t=s,A_t=a]}{\underbrace{\sum_{r}rp(r|s,a)}}\\=\sum_a\pi(a|s)\cdot r(s,a)\\\triangleq r_\pi(s) (1)=a∑π(a∣s)⋅r(s,a)≜Eπ[Rt+1∣st=s,At=a]
r∑rp(r∣s,a)=a∑π(a∣s)⋅r(s,a)≜rπ(s)
( 2 ) = γ ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) V π ( s ′ ) = γ ∑ s ′ ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) ⏟ P π ( s , s ′ ) V π ( s ′ ) = γ ∑ s ′ P π ( s , s ′ ) V π ( s ′ ) (2)=\gamma \sum_a\pi(a|s)\sum_{s'}p(s'|s,a)V_\pi(s')\\=\gamma \sum_{s'}\underset{P_\pi(s,s')}{\underbrace{\sum_a\pi(a|s)p(s'|s,a)}}V_\pi(s')\\=\gamma \sum_{s'}P_\pi(s,s')V_\pi(s') (2)=γa∑π(a∣s)s′∑p(s′∣s,a)Vπ(s′)=γs′∑Pπ(s,s′)
a∑π(a∣s)p(s′∣s,a)Vπ(s′)=γs′∑Pπ(s,s′)Vπ(s′)
于是, V π ( s ) = r π ( s ) + γ ∑ s ′ P π ( s , s ′ ) V π ( s ′ ) V_\pi(s)= r_\pi(s)+\gamma \sum_{s'}P_\pi(s,s')V_\pi(s') Vπ(s)=rπ(s)+γs′∑Pπ(s,s′)Vπ(s′)
1.解析解
用矩阵的表达形式来解。
令 s i ≜ s , s j ≜ s ′ s_i\triangleq s,s_j\triangleq s' si≜s,sj≜s′,得到: V π ( s i ) = r π ( s i ) + γ ∑ j = 1 ∣ S ∣ P π ( s i , s j ) V π ( s j ) V π = r π + γ P π V π ( I − γ P π ) V π = r π V π = ( I − γ P π ) − 1 r π V_\pi(s_i)= r_\pi(s_i)+\gamma \sum_{j=1}^{|S|}P_\pi(s_i,s_j)V_\pi(s_j)\\V_\pi=r_\pi+\gamma P_\pi V_\pi\\(I-\gamma P_\pi)V_\pi=r_\pi\\V_\pi=(I-\gamma P_\pi)^{-1}r_\pi Vπ(si)=rπ(si)+γj=1∑∣S∣Pπ(si,sj)Vπ(sj)Vπ=rπ+γPπVπ(I−γPπ)Vπ=rπVπ=(I−γPπ)−1rπ
复杂度为: O ( ∣ S ∣ 3 ) O(|S|^3) O(∣S∣3)
2.迭代解(数值解)
构造一个数列,让这个数列收敛于 V π V_\pi Vπ,收敛性证明暂时不管。
lim k → ∞ { V k } = V π \lim_{k\rightarrow\infty}\{V_k\}=V_\pi k→∞lim{
Vk}=Vπ
我们根据贝尔曼方程就可以得到, V k + 1 ( s ) ≜ ∑ a π ( a ∣ s ) ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V k ( s ′ ) ] V_{k+1}(s)\triangleq \sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma V_k(s')] Vk+1(s)≜a∑π(a∣s)s′,r∑p(s′,r∣s,a)[r+γVk(s′)]
(二)策略改进
策略改进对应的是贝尔曼最优方程。
1.策略改进定理
给定 π , π ′ \pi,\pi' π,π′,如果 ∀ s ∈ S , q π ( s , π ′ ( s ) ) ≥ V π ( s ) \forall s\in S,q_\pi(s,\pi'(s))\ge V_\pi(s) ∀s∈S,qπ(s,π′(s))≥Vπ(s),那么则有 ∀ s ∈ S , V π ′ ( s ) ≥ V π ( s ) \forall s\in S,V_{\pi'}(s)\ge V_\pi(s) ∀s∈S,Vπ′(s)≥Vπ(s)
证明:
∀ s ∈ S ( q π ( s , a ) = ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ( s ′ ) ] = E [ R t + 1 + γ V π ( s t + 1 ) ∣ s t = s , A t = a ] ) \forall s\in S\\(q_\pi(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma V_\pi(s')]=E[R_{t+1}+\gamma V_\pi(s_{t+1})|s_t=s,A_t=a]) ∀s∈S(qπ(s,a)=s′,r∑p(s′,r∣s,a)[r+γVπ(s′)]=E[Rt+1+γVπ(st+1)∣st=s,At=a])
V π ( s ) ≤ q π ( s , π ′ ( s ) ) = E [ R t + 1 + γ V π ( s t + 1 ) ∣ s t = s , A t = π ′ ( s ) ] = E π ′ [ R t + 1 + γ V π ( s t + 1 ) ∣ s t = s ] ≤ E π ′ [ R t + 1 + γ q π ( s ( t + 1 ) , π ′ ( s t + 1 ) ) ∣ s t = s ] = E π ′ [ R t + 1 + γ E π ′ [ R t + 2 + γ V π ( s t + 2 ) ∣ s t + 1 ] ∣ s t = s ] = E π ′ [ R t + 1 + γ E π ′ [ R t + 2 ∣ s t + 1 ] + γ 2 E π ′ [ V π ( s t + 2 ) ∣ s t + 1 ] ∣ s t = s ] = E π ′ [ R t + 1 + γ R t + 2 + γ 2 V π ( s t + 2 ) ∣ s t = s ] ⋯ ≤ E π ′ [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ + ⋯ ⏟ G t ∣ s t = s ] = V π ′ ( s ) V_\pi(s)\le q_\pi(s,\pi'(s))\\=E[R_{t+1}+\gamma V_\pi(s_{t+1})|s_t=s,A_t=\pi'(s)]\\=E_{\pi'}[R_{t+1}+\gamma V_\pi(s_{t+1})|s_t=s]\\\le E_{\pi'}[R_{t+1}+\gamma q_\pi(s_{(t+1)},\pi'(s_{t+1}))|s_t=s]\\=E_{\pi'}\Big[R_{t+1}+\gamma E_{\pi'}[R_{t+2}+\gamma V_\pi(s_{t+2})|s_{t+1}]\Big|s_t=s\Big]\\=E_{\pi'}\Big[R_{t+1}+\gamma E_{\pi'}[R_{t+2}|s_{t+1}]+\gamma^2 E_{\pi'}[V_\pi(s_{t+2})|s_{t+1}]\Big|s_t=s\Big]\\=E_{\pi'}\Big[R_{t+1}+\gamma R_{t+2}+\gamma^2 V_\pi(s_{t+2})\Big|s_t=s\Big]\\\cdots\\\le E_{\pi'}\Big[\underset{G_t}{\underbrace{R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots+\cdots}}\Big|s_t=s\Big]\\=V_{\pi'}(s) Vπ(s)≤qπ(s,π′(s))=E[Rt+1+γVπ(st+1)∣st=s,At=π′(s)]=Eπ′[Rt+1+γVπ(st+1)∣st=s]≤Eπ′[Rt+1+γqπ(s(t+1),π′(st+1))∣st=s]=Eπ′[Rt+1+γEπ′[Rt+2+γVπ(st+2)∣st+1]∣∣∣st=s]=Eπ′[Rt+1+γEπ′[Rt+2∣st+1]+γ2Eπ′[Vπ(st+2)∣st+1]∣∣∣st=s]=Eπ′[Rt+1+γRt+2+γ2Vπ(st+2)∣∣∣st=s]⋯≤Eπ′[Gt Rt+1+γRt+2+γ2Rt+3+⋯+⋯∣∣∣st=s]=Vπ′(s)
2.贪心策略
Greedy Policy: ∀ s ∈ S \forall s\in S ∀s∈S, π ′ ( s ) = arg max a q π ( s , a ) \pi'(s)=\argmax_aq_\pi(s,a) π′(s)=aargmaxqπ(s,a)
∀ s ∈ S , V π ( s ) ≤ max a q π ( s , a ) = q π ( s , π ′ ( s ) ) \forall s\in S,V_\pi(s)\le \max_a q_\pi(s,a)=q_\pi(s,\pi'(s)) ∀s∈S,Vπ(s)≤amaxqπ(s,a)=qπ(s,π′(s))
由策略改进定理可知: ∀ s ∈ S , V π ′ ( s ) ≥ V π ( s ) \forall s\in S,\;\;\;\;\;V_{\pi'}(s)\ge V_\pi(s) ∀s∈S,Vπ′(s)≥Vπ(s)
If V π ′ = V π V_{\pi'}=V_\pi Vπ′=Vπ,then V π ′ = V π = V ∗ V_{\pi'}=V_\pi=V_* Vπ′=Vπ=V∗,证明如下:
V π ′ = V π → q π ′ = q π ∀ s ∈ S , V π ′ ( s ) = ∑ a π ′ ( a ∣ s ) q π ′ ( s , a ) = ∑ a π ′ ( a ∣ s ) q π ( s , a ) = q π ( s , π ′ ( s ) ) = max a q π ( s , a ) = max a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ( s ′ ) ] = max a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ′ ( s ′ ) ] V_{\pi'}=V_\pi\;\;\;\;\;\;\;\rightarrow\;\;\;\;\;q_{\pi'}=q_\pi\\\forall s\in S,\;\;\;V_{\pi'}(s)=\sum_a\pi'(a|s)q_{\pi'}(s,a)\\=\sum_a\pi'(a|s)q_{\pi}(s,a)\\=q_\pi(s,\pi'(s))\\=\max_aq_\pi(s,a)\\=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma V_\pi(s')]\\=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma V_{\pi'}(s')] Vπ′=Vπ→qπ′=qπ∀s∈S,Vπ′(s)=a∑π′(a∣s)qπ′(s,a)=a∑π′(a∣s)qπ(s,a)=qπ(s,π′(s))=amaxqπ(s,a)=amaxs′,r∑p(s′,r∣s,a)[r+γVπ(s′)]=amaxs′,r∑p(s′,r∣s,a)[r+γVπ′(s′)]
于是,
V π ′ ( s ) = max a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V π ′ ( s ′ ) ] V_{\pi'}(s)=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma V_{\pi'}(s')] Vπ′(s)=amaxs′,r∑p(s′,r∣s,a)[r+γVπ′(s′)]
又因为: V ∗ ( s ) = max a ∑ s ′ , r p ( s ′ , r ∣ s , a ) [ r + γ V ∗ ( s ′ ) ] V_*(s)=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma V_*(s')] V∗(s)=amaxs′,r∑p(s′,r∣s,a)[r+γV∗(s′)]
所以, V π ′ = V ∗ V_{\pi'}=V_* Vπ′=V∗
即, V π ′ = V π = V ∗ V_{\pi'}=V_\pi=V_* Vπ′=Vπ=V∗
二、价值迭代
是极端情况下的策略迭代,策略评估只进行一步,然后策略改进。
价值迭代虽然只走一步,但是还是会更新 S S S中的所有状态。比价值迭代更简单的是就地策略迭代(异步策略迭代的特例),只选取 S S S中的一个状态进行一步更新。
\;
\;
\;
\;
\;
完结!!!再贴一下汇总贴:机器学习-白板推导系列笔记