强化学习之策略迭代

1. Policy evaluation:

In policy evaluation,  our aim is to compute state-value function $v_{\pi}$ for an arbitrary policy $\pi$.    As per the following equation, for all $s \in S$, we have $|S|$ linear equations.  


Next, to solve the equations under a given policy, we iteration solution methods.  Specifically, in iteration one, we assume that all state-value functions have a value of zero.  Based on the above Bellman function, we update the value of all state-value functions.  This process continues until the value of state-value function converges.  One typical stopping condition is as follows.   



2. Policy improvement

The aim of this step is to select optimal actions for each state.  In particular, when an agent takes action a in state s,  its action-value function is as follows.  


We use this equations to select optimal action for each state.  If Q is larger than the current V, we update its action.  

(未完待续)

扫描二维码关注公众号,回复: 1909212 查看本文章

猜你喜欢

转载自blog.csdn.net/liverpool_05/article/details/80050751