1. Policy evaluation:
In policy evaluation, our aim is to compute state-value function $v_{\pi}$ for an arbitrary policy $\pi$. As per the following equation, for all $s \in S$, we have $|S|$ linear equations.
Next, to solve the equations under a given policy, we iteration solution methods. Specifically, in iteration one, we assume that all state-value functions have a value of zero. Based on the above Bellman function, we update the value of all state-value functions. This process continues until the value of state-value function converges. One typical stopping condition is as follows.
2. Policy improvement
The aim of this step is to select optimal actions for each state. In particular, when an agent takes action a in state s, its action-value function is as follows.
We use this equations to select optimal action for each state. If Q is larger than the current V, we update its action.
(未完待续)