Chapter 4 Dynamic Programming

本笔记参考《Reinforcement Learning: An Introduction》和
David Silver的公开课及其ppt

David Silver的课程在Tabular Soluction上介绍的比较多。可以配合David Silver的课程来理解《Reinforcement Learning: An Introduction》这本书的内容


DP指的是一组算法,可以用来计算最佳策略,给定一个完美的model作为马尔科夫决策过程(MDP)[这是必须的]。当然之后介绍的算法不是用DP解的,它只是给后面要介绍的方法基础理论

一定要注意DP解问题的必要条件。我们假设environment是finite MDP。其中我们假设它的state,action以及reward sets, S , A , and, R 是有限的,而且它的动态性是通过一系列的概率 p ( s , r | s , a ) 给出来的

4.2 Policy Evaluation (Prediction)

Policy evaluation Estimate v π
Iterative policy evaluation

Policy Evaluation就是对于任意policy π ,计算出state-value function v π 。这也被看成prediction problem
Iterative Policy Evaluation, for estimating $V \approx v_{\pi}$

4.2 Policy Improvement

Policy improvement Generate π π
Greedy policy improvement

policy improvement theorem
假设有 π π 更好

q π ( s , π ( s ) ) v π ( s ) v π ( s ) v π ( s )

证明:
v π ( s ) q π ( s , π ( s ) ) = E [ R t + 1 + γ v π ( S t + 1 ) | S t = s , A t = π ( s ) ] = E π [ R t + 1 + γ v π ( S t + 1 ) | S t = s ] E π [ R t + 1 + γ q π ( S t + 1 , π ( S t + 1 ) ) | S t = s ] = E π [ R t + 1 + γ E π [ R t + 2 + γ v π ( S t + 2 ) | S t + 1 ] | S t = s ] = E π [ R t + 1 + γ R t + 2 + γ 2 v π ( S t + 2 ) | S t = s ] E π [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + γ 3 v π ( S t + 3 ) | S t = s ] E π [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + γ 3 R t + 4 + | S t = s ] = v π ( s ) .

很自然的就会想到使用 greedy policy在每个状态s根据 q π ( s , a ) 选择最好的a,从而得到新的policy π
π ( s ) a r g max a q π ( s , a ) = a r g max a E [ R t + 1 + γ v π ( S t + 1 ) | S t = s , A t = a ] = a r g max a s , r p ( s , r | s , a ) [ r + γ v π ( s ) ]

4.3 Policy Iteration

把Policy Evaluation (Prediction)和Policy Improvement两个过程迭代进行,最终获得收敛的最佳policy

扫描二维码关注公众号,回复: 1815598 查看本文章
π 0 E v π 0 I π 1 E v π 1 I π 2 E π E v

policy iteration
注意上图的迭代是Policy Evaluation和Policy Improvement交替进行的

这个过程被证明是收敛的,最后一定可以收敛到最佳的policy

4.4 Value Iteration

Value Iteration不像policy iteration,没有显式的 policy evaluation。policy iteration的一个缺点是每次迭代都要进行完整的policy evaluation,这非常的耗时。

policy evaluation的步骤可以被截取为少许的几步,而且还保证policy iteration的收敛。一个特殊的例子就是在仅进行一个sweep后停止。
Value Iteration
在每个sweep中,执行一个sweep的policy evaluation和一个sweep的policy improvement
注意与policy iteration的区别 p ( s , r | s , π ( s ) ) p ( s , r | s , a )

Problem Bellman Equation Algorithm
Prediction Bellman Expectation Equation Iterative Policy Evaluation
Control Bellman Expectation Equation + Greedy Policy Improvement Iterative Policy Evaluation
Control Bellman Optimality Equation Value Iteration
4.6 Generalized Policy Iteration (GPI)

上面说的迭代就是强化学习的迭代框架
policy iterator
policy iterator

猜你喜欢

转载自blog.csdn.net/dengyibing/article/details/80460901