The reinforcement learning methods we discussed in the previous chapters all model the value function as a table, and query the value of a specific state through the state. But when the state-action space is huge and most state-actions are not meaningful, the efficiency of this table query is extremely low.
Therefore, in this section, the value function is modeled as a parameter model , where is the parameter of the value estimation model, and the state is the output of the value estimation model, and the value estimation of the state is output through the model.
1. supervised learning
So how to learn this model, the model mainly lies in the value of the fitting state, which can be expressed as the optimal benefit under the optimal action decision . In order to fit this function, the method of supervised learning is used to define the following learning loss The Prediction Objective (VE):
In the above formula, it represents the occurrence probability of the state s, which is satisfied , assuming that it represents the average number of occurrences of the state s in a single episode, and represents the probability that the state s appears in the initial state of a single episode.
2. Stochastic-gradient and Semi-gradient Methods
The parameters of the solution model are optimized using SGD:
In the above formula, it represents the state value under the decision function, and we can use the MC method to obtain cumulative rewards through sampling to calculate.
Another method is to use a bootstrapping method similar to the TD or DP algorithm to replace the cumulative income of real samples with an estimated value . This approach is called Semi-gradient Methods.
- Dynamic programming:
- TD(0):
- TD(n):
3. Episodic Semi-gradient Control
We discussed how to estimate the value function through the model. Next, we can easily combine the GPI strategy to construct a two-step reinforcement learning process TD(0) on-policy sarsa of value estimate and policy improve:
At the same time, TD(n) on-policy sarsa can be expressed as follows. It can be seen that it mainly replaces the value update in the original table method with the parameter update in the value estimation model.
4. Average Reward: Continuing Tasks
When solving the cumulative income before , a discount factor was introduced . There are two main reasons for it: one is to avoid the non-convergence of the cumulative income value, and the other is to consider that the recent income has a greater impact. However, when faced with a continuous action scene (no start state and no final state), the latter assumption is problematic, especially in a certain equilibrium swing state, adding discounts will lose future state information. Therefore, there is another method of Average Reward, which can also avoid the non-convergence of the cumulative reward value.
First define the average payoff under the decision:
Furthermore, differential cumulative returns are defined :
TD(n) form can be defined as:
Average returns can be iterated as follows:
At this time, the algorithm of on-policy Sarsa based on TD(n) is described as follows: