Intensive Study Notes-0910 On-policy Method with Approximation

The reinforcement learning methods we discussed in the previous chapters all model the value function v(s)as a table, and squery the value of a specific state through the state. But when the state-action space is huge and most state-actions are not meaningful, the efficiency of this table query is extremely low.

Therefore, in this section, the value function is modeled as a parameter model v(s|w), where wis the parameter of the value estimation model, and the state sis the output of the value estimation model, and the value estimation of the state is output through the model.

1. supervised learning

So how to learn this model, the model mainly lies in the value of the fitting state, which can be expressed as the optimal benefit under the optimal action decision G_\pi(s). In order to fit this function, the method of supervised learning is used to define the following learning loss The Prediction Objective (VE):

\bar{VE}(w)=\sum_s \mu (s)[G_\pi (s)-v(s|w)]^2

In the above formula, ourit represents the occurrence probability of the state s, which is satisfied \sum_s \mu(s)=1, assuming \eta(s)that it represents the average number of occurrences of the state s in a single episode, and h(s)represents the probability that the state s appears in the initial state of a single episode.

\eta(s)=h(s)+\sum_{​{s}'}\eta (s')\sum_a \pi(a|s')p(s,r|s',a)

\mu(s)=\frac{\eta (s)}{\sum_{s'}\eta(s')}

2. Stochastic-gradient and Semi-gradient Methods

The parameters of the solution model are optimized using SGD:

w_{t+1}=w_t - \frac{1}{2}\alpha \partial_w[G_\pi(s)-v(s|w)]^2=w_t + \alpha[G_\pi(s)-v(s|w)]\partial_w v(s|w)

In the above formula, G_\pi(s)it represents \pithe state value under the decision function, and we can use the MC method to obtain cumulative rewards through sampling to calculate.

Another method is to use a bootstrapping method similar to the TD or DP algorithm to U(s)replace the cumulative income of real samples with an estimated value G_\pi(s). This approach is called Semi-gradient Methods.

  • Dynamic programming:U_t(s) = \sum_{s'} [r + \gamma v(s'|w)] \sum_a \pi(a|s')p(s, r|s',a)
  • TD(0):U_t(s)=r_{t+1} + \gamma v(s_{t+1}|w)
  • TD(n):U_t(s)=\sum_{i=0}^{n-1}\gamma^{i} r_{t+i+1} + \gamma^{n} v(s_{t+n}|w)

3. Episodic Semi-gradient Control

We discussed how to estimate the value function through the model. Next, we can easily combine the GPI strategy to construct a two-step reinforcement learning process TD(0) on-policy sarsa of value estimate and policy improve:

At the same time, TD(n) on-policy sarsa can be expressed as follows. It can be seen that it mainly replaces the value update in the original table method Q(s,a)with the parameter update in the value estimation model.

4. Average Reward: Continuing Tasks 

 When solving the cumulative income before G_t, a discount factor was introduced \gamma. There are two main reasons for it: one is to avoid the non-convergence of the cumulative income value, and the other is to consider that the recent income has a greater impact. However, when faced with a continuous action scene (no start state and no final state), the latter assumption is problematic, especially in a certain equilibrium swing state, adding discounts will lose future state information. Therefore, there is another method of Average Reward, which can also avoid the non-convergence of the cumulative reward value.

First define \pithe average payoff under the decision:

r(\pi)=\sum_s \mu_{\pi}(s)\sum_a\pi(a|s)\sum_{s'}p(s,r|s',a)r

Furthermore, differential cumulative returns are defined G_t:

G_t=r_{t+1}-r(\pi)+r_{t+2}-r(\pi)+... = \sum_{i=1} (r_{t+i}-r(\pi))

TD(n) form can be defined as:

G_t=\sum_{i=1}^n(r_{t+i}-r(\pi))+v(s_{t+n}|w)

\delta_t(s_t)=G_t(s)-v(s|w)=\sum_{i=1}^n(r_{t+i}-r(\pi))+v(s_{t+n}|w)-v(s_t|w)

Average returns can be iterated as follows:

r_{t+1}(\pi)=\frac{1}{t+1}\sum_i^{t+1} r_i =r_{t}(\pi) + \frac{1}{t+1}(r_t-r_{t}(\pi))\\ =r_{t}(\pi) + \beta (\sum_{i}^n[r_t - r_{t}(\pi)]+v(s_{t+n}|w)-v(s_t|w))

At this time, the algorithm of on-policy Sarsa based on TD(n) is described as follows:

Guess you like

Origin blog.csdn.net/tostq/article/details/131185674