value iteration
Through the study in the previous chapter, we know that the solution of the Bellman optimal equation is actually divided into two parts, one is to give an initial value vk v_kvkFind the optimal policy π k + 1 π_{k+1}Pik+1, the second is to update vk + 1 v_{k+1}vk+1
Below, we will analyze this algorithm in detail, as well as its programming implementation. First of all, let's take a look at his first step: policy update
through a given vk v_kvkThe qk q_k corresponding to each state can be obtainedqkThen according to the probability design, the corresponding behavior ak ∗ ( s ) a_k^*(s) under the optimal strategy is obtainedak∗(s)
The second step: value update , the same, through the given vk v_kvkFind the qk q_k corresponding to each stateqkThen calculate according to the optimal strategy to get vk + 1 v_{k+1}vk+1
Through the above explanation, we get the following process:
the pseudo code of the above algorithm is given as follows:
Value Iteration: Examples
Let's deepen our understanding with an example. rboundary = rtrap = −1, rendpoint = +1, γ = 0.9 r_{boundary}=r_{trap}=-1, r_{endpoint}=+1, γ=0.9rboundary=rtrap=−1,rend point=+ 1 ,c=0.9
when k = 0 k=0k=0
policy iteration
Policy iteration is divided into two steps: policy evaluation ( PE ) (PE)( PE ) and Policy Optimization( PI ) (PI)(PI)。
Solving for v π k v_{πk}vpkThere are two methods. The first matrix solution is generally not used, and the second iterative method is mainly used.
The specific steps of policy iteration are as follows:
The pseudocode is as follows:
Strategy Iteration: An Example
Similarly, we deepen our understanding with an example. r Boundary = − 1 , r Endpoint = + 1 , γ = 0.9 r_{boundary}=-1, r_{Endpoint}=+1, γ=0.9rboundary=−1,rend point=+ 1 ,c=0.9 , the behavior is: leftal a_lal, to the right ar a_rar, in situ a 0 a0of 0
Strategy iteration: case two
Truncation Strategy Iterative Algorithm
First, let's compare the difference between value iteration and policy iteration:
pseudocode: