RL-Zhao-(8)-Value-Based03: Q-learning Function Approximation [Goal: Calculate the optimal "value function" parameters, and the optimal Action Value calculated through this "value function"]

Insert image description here
Insert image description here
我们知道:
“TD learning” with “value function approximate”:
w t + 1 = w t + α t [ r t + 1 + γ v ^ ( s t + 1 , w t ) − v ^ ( s t , w t ) ] ∇ w v ^ ( s t , w t ) \color{red}{w_{t+1}=w_t+\alpha_t\left[r_{t+1}+\gamma\hat{v}(s_{t+1},w_t)-\hat{v}(s_t,w_t)\right]\nabla_w\hat{v}(s_t,w_t)} Int+1=Int+at[rt+1+cin^(st+1,Int)in^(st,Int)]win^(st,Int)

“Sarsa算法” with “value function approximate”:
w t + 1 = w t + α t [ r t + 1 + γ q ^ ( s t + 1 , a t + 1 , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) \color{red}{w_{t+1}=w_t+\alpha_t\left[r_{t+1}+\gamma\hat{q}(s_{t+1},a_{t+1},w_t)-\hat{q}(s_t,a_t,w_t)\right]\nabla_w\hat{q}(s_t,a_t,w_t)} Int+1=Int+at[rt+1+cq^(st+1,at+1,Int)q^(st,at,Int)]wq^(st,at,Int)

Similarly, tabular Q-learning can also be extended to the case of value function approximation. The q-value update rule is:

“Q-learning算法” with “value function approximate”:

w t + 1 = w t + α t [ r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) \color{red}{w_{t+1}=w_t+\alpha_t}\left[r_{t+1}+\gamma\max_{a\in\mathcal{A}(s_{t+1})}\hat{q}(s_{t+1},a,w_t)-\hat{q}(s_t,a_t,w_t)\right]\nabla_w\hat{q}(s_t,a_t,w_t) Int+1=Int+at[rt+1+caA(st+1)maxq^(st+1,a,Int)q^(st,at,Int)]wq^(st,at,Int)

This is the same algorithm as Sarsa with Function Approximation, except q ​​^ ( s t + 1 , a t + 1 , w t ) \hat{q}\left(s_{t+1 },a_{t+1},w_t\right) q^(st+1,at+1,Int) 被替换为 m a x a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) \mathrm{max}_{a\in\mathcal{A}(s_{t+1})}\hat{q}\left(s_{t+1},a,w_t\right) maxaA(st+1)q^(st+1,a,Int)

Insert image description here
Q-learning with function approximation pseudocode (on-policy version):

For each episode we do the following:

  • If the current state s t s_t stIt is not the target state yet, then we do the following operations. This task actually corresponds to me starting from a state, and then I just need to find a good path to the target state. So the first step is to generate the data:

    • I am s t s_t stWhen I have to base this on π t ( s t ) π_t(s_t) Pit(st) action after strategy a t a_t at, and then interact with the environment to get r t + 1 , s t + 1 r_{t+1},s_{t+1} rt+1,st+1
    • 然后根据这个数据下面我们来做value update:
      w t + 1 = w t + α t [ r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q ^ ( s t + 1 , a , w t ) − q ^ ( s t , a t , w t ) ] ∇ w q ^ ( s t , a t , w t ) w_{t+1}\quad=\quad w_{t}\quad+\quad\alpha_{t}\left[r_{t+1}+\gamma\max_{a\in\mathcal{A}(s_{t+1})}\hat{q}(s_{t+1},a,w_{t})-\hat{q}(s_t,a_t,w_t)\right]\nabla_w\hat{q}(s_t,a_t,w_t) Int+1=Int+at[rt+1+caA(st+1)maxq^(st+1,a,Int)q^(st,at,Int)]wq^(st,at,Int)
      Note that we are not updating directly here q ^ ( s t + 1 , a t + 1 ) \hat{q}(s_{ t+1},a_{t+1}) q^(st+1,at+1),不是要计算 q ^ ( s t + 1 , a t + 1 ) \hat{q}(s_{t+1},a_{t+1}) q^(st+1,at+1) should be equal to nothing, but we want to update its weight parameter w w w, this is the only difference from the previous tabular Sarsa.
  • With this, we can do policy update, which is exactly the same as the previous tabular Sarsa. That is to say, I will choose at s t s_t stAmong all actions, the action with the largest action value will be given a relatively large probability. The strategy here is ε-Greedy, and other actions will be given a relatively small probability.

    • It is worth noting that in the previous tabular situation, I could actuallygo to the index directly to get this< a i=3> q ^ ( s t + 1 , a t + 1 ) \hat{q}(s_{t+1},a_{t+1}) q^(st+1,at+1)
    • Now I need to do some calculations. Substitute this s and the corresponding a into this"value function" and do the calculations. The value of this function q ^ ( s t + 1 , a t + 1 ) \hat{q}(s_{t+1},a_{t+1}) q^(st+1,at+1) and then compare;
      Insert image description here

Insert image description here




Reference materials:
[Reinforcement Learning] Mathematical Basis of Reinforcement Learning: Value Function Approximation
6. Value Function Approximation a>
Lecture 6: Value Function Approximation (Value Function Approximation)

Guess you like

Origin blog.csdn.net/u013250861/article/details/135027523