Intensive Study Notes-11 Off-policy Methods with Approximation

In the previous chapters, we discussed the off-policy method. The biggest difference between it and the on-policy method is that the actions it takes during training are based on the behavior policy, not the target policy. The advantage of this approach is that it takes both exploitation and exploration into account. This section discusses how to apply off-policy reinforcement learning by means of model approximation.

1. Importance sampling

A big problem with the off-policy strategy is that there is a deviation between the target policy and the behavior policy. This deviation can be corrected by importance sampling:

\rho_t = \frac{\pi(A_t|S_t)}{b(A_t|S_t)}

G_t^{\pi}(s) =\rho_t G_t^{b}(s)=(\prod_{i=0}^{n-1} \rho_{t+i})G_{t:t+n}^{b}(s)

The parameter update formula of the value model before this time:

w_{t+1}=w_t + \alpha \rho_t (G_t-v(s|w))\partial_w v(s|w)\\ =w_t + \alpha (\prod_{i=0}^{n-1} \rho_{t+i}) (G_{t:t+n}-v(s|w))\partial_w v(s|w)\\ =w_t + \alpha (\prod_{i=0}^{n-1} \rho_{t+i})\delta_t \partial_w v(s|w)

Here \delta_tcan be set according to TD(0) VS TD(n) and discounted rewards VS average rewards in the previous section:

  • TD(0) discounted rewards:\delta_t = r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}|w)-Q(s_t, a|w)
  • TD(n) discounted rewards:\delta_t = \sum_{i=0}^{n-1} \gamma^i r_{t+i+1} + \gamma^n Q(s_{t+n}, a_{t+n}|w)-Q(s_t, a|w)
  • TD(0) average rewards:\delta_t = r_{t+1} - r_t+ Q(s_{t+1}, a_{t+1}|w)-Q(s_t, a|w)
  • TD(n) average rewards:\delta_t = \sum_{i=0}^{n-1} (r_{t+i+1} - r_t)+ Q(s_{t+n}, a_{t+n}|w)-Q(s_t, a|w)

In addition, another off-policy strategy that does not pass re-sampling is introduced in Chapter 7 : tree-backup algorithm

G_t(s,a)=r_{t+1} + \gamma \sum_{a'\neq a_{t+1}}\pi (a'|s_{t+1}) Q(s_{t+1},a') + \gamma \pi (a_{t+1}|s_{t+1}) G(s_{t+1},a_{t+1})

2. Off-policy Divergence

Due to the deviation between the target policy and the behavior policy, when the deviation is too large, the parameters of the value model will not converge. Because the off-policy strategy may cause a large number of certain useless states in the system to be updated repeatedly, and the updates of these states will push the overall parameters wto increase.

To solve this problem, one way of thinking is to use the Q-learning method, and then use the combination of ε-greedy to select in the next action Q(s, a|w).

\delta_t = r_{t+1} - r_t+ \max_{a_{t+1}}Q(s_{t+1}, a_{t+1}|w)-Q(s_t, a|w)

3. Bellman error

The parameters of the previous value model are learned based on the VE loss of the square difference between the cumulative return and the estimated value. This is reasonable for the MC algorithm, but it is not completely suitable for the TD algorithm. Therefore, the following loss is defined according to the Bellman formula

BE=\sum_s \mu(s)[\sum_{a',s',r} \pi(a'|s')p(s',r|s,a)(r+\gamma v(s'|w))-v(s|w)]^2

In the case of TD(0), it can be rewritten as Bellman error is actually the expected value of TD error.

TDE=\sum_s \mu(s) E_\pi [\delta_t^2|s]=\sum_s \mu(s) E_b [\rho _t\delta_t^2|s]

\delta_t = r_{t+1} + \gamma v(s_{t+1}|w) - v(s_t|w)

Guess you like

Origin blog.csdn.net/tostq/article/details/131193676