Reinforcement Learning: Value Iteration and Policy Iteration

value iteration

insert image description here
  Through the study in the previous chapter, we know that the solution of the Bellman optimal equation is actually divided into two parts, one is to give an initial value vk v_kvkFind the optimal policy π k + 1 π_{k+1}Pik+1, the second is to update vk + 1 v_{k+1}vk+1
insert image description here
  Below, we will analyze this algorithm in detail, as well as its programming implementation. First of all, let's take a look at his first step: policy update
insert image description here
  through a given vk v_kvkThe qk q_k corresponding to each state can be obtainedqkThen according to the probability design, the corresponding behavior ak ∗ ( s ) a_k^*(s) under the optimal strategy is obtainedak(s)

  The second step: value update , the same, through the given vk v_kvkFind the qk q_k corresponding to each stateqkThen calculate according to the optimal strategy to get vk + 1 v_{k+1}vk+1
insert image description here
Through the above explanation, we get the following process:
insert image description here
the pseudo code of the above algorithm is given as follows:
insert image description here

Value Iteration: Examples

  Let's deepen our understanding with an example. rboundary = rtrap = −1, rendpoint = +1, γ = 0.9 r_{boundary}=r_{trap}=-1, r_{endpoint}=+1, γ=0.9rboundary=rtrap=1rend point=+ 1 c=0.9

insert image description here

insert image description here

when k = 0 k=0k=0
insert image description here
insert image description here

policy iteration

  Policy iteration is divided into two steps: policy evaluation ( PE ) (PE)( PE ) and Policy Optimization( PI ) (PI)(PI)
insert image description here

  Solving for v π k v_{πk}vpkThere are two methods. The first matrix solution is generally not used, and the second iterative method is mainly used.
insert image description here

  The specific steps of policy iteration are as follows:
insert image description here
insert image description here

The pseudocode is as follows:
insert image description here

Strategy Iteration: An Example

  Similarly, we deepen our understanding with an example. r Boundary = − 1 , r Endpoint = + 1 , γ = 0.9 r_{boundary}=-1, r_{Endpoint}=+1, γ=0.9rboundary=1rend point=+ 1 c=0.9 , the behavior is: leftal a_lal, to the right ar a_rar, in situ a 0 a0of 0
insert image description here

insert image description here
insert image description here

Strategy iteration: case two

insert image description here

Truncation Strategy Iterative Algorithm

  First, let's compare the difference between value iteration and policy iteration:
insert image description here

insert image description here
pseudocode:
insert image description here

Guess you like

Origin blog.csdn.net/qq_50086023/article/details/130799817