[Learning] Q learning, Q-learning for continuous actions, conjectures about deep learning


一、Q learning

In a value-based approach, the critic does not directly decide on an action. Given an actor π, it evaluates how good the actor is. State-value function Vπ(s): When using actor π, the cumulative reward expected to be obtained (predicted) after visiting state s.
insert image description here
The critic is bound to an actor to predict the score of this actor.

Evaluate the state-value function Vπ(s)

MC

insert image description here
You have to play the game to the end before you can update the network parameters and get the score.

TD

insert image description here
The randomness of MC is large, so there is a large variance; and the TD method is not necessarily accurate.
insert image description here
insert image description here
The state-action-value function Qπ(s, a), when using actor π, expects to obtain cumulative rewards after taking a in state s.
The action cannot be exhaustive, so the left one is used, and the right formula is only used for discrete actions.
insert image description here
insert image description here
insert image description here
After a continuous loop, a good equation will be obtained.
Given all actions, find the action that will maximize Q, which is π'. But this is not suitable for the case where a is continuous (a is discrete).
insert image description here
insert image description here
In fact, Q' will change, so our goal will always change (make Q close to Q'), so we have to fix a Q' , update another Q. If Q has been updated many times, we will update Q' once.
insert image description here
If it is an action that has not been seen, it may be the initial valuation, and as long as you have seen one, it is a Q of 1, and the others are 0, so you will not try other actions.
insert image description here
epsilon greedy: E is relatively large at the beginning of training, and will become smaller later. E is generally a very small value, which is used to estimate Q with a high probability, and randomness is used for small probability. Add noisy to the action space.
Boltzmann Exploration: Take the logarithm of Q and then do normalization (dividing the sum of the logarithms of Q). The output distribution of Q may be very uniform at the beginning.
insert image description here

Put all the collected data into the replay buffer (there may be data calculated by different policies in the previous training, and it will be discarded when it is full), and s, a, r, s are stored.
insert image description here
Benefits: The same batch has a variety of data, which will avoid breaking the network. Although there are past data, but this is no problem. This is similar to an off-policy approach, and will not take so much time to collect data.
insert image description here
The above error: the batch is taken from the buffer.

double DQN

insert image description here
DQN is always overestimated (exhaustively enumerating all a to make the a with the largest Q)
insert image description here
double DQN chooses the Q of a and calculates Q differently. If Q overestimates a, then it is chosen. Q' will give it the appropriate value.
Q 'What about overvaluation? Q does not choose the action.
insert image description here
Use Q that can update parameters to select a, and use Q' with fixed parameters to calculate value.

dueling DQN

The only change: changing the structure of the network.
insert image description here
insert image description here

After updating V, Q will be updated, so that it will be efficient
insert image description here

prioritized reply

For the data in the buffer, if some data is more important and some data is not well trained, then it cannot be sampled consistently. Data with a larger TD error in previous training has a higher probability of being sampled.
insert image description here
Not only changes the distribution of sampling data, but also changes the training process.

multi-step

The balance between MC and TD
insert image description here
Combining the two methods, the value is estimated after sampling multiple steps, and the impact of the estimated part will be relatively slight. Because N items of r are added, the variance will be relatively large, and N can be adjusted to make it balanced.

noisy net

Add noisy to the parameter space. Add noise to the parameters of the Q network to get Q~.
Add noise to each parameter. Before the start of each episode, noise is injected into the parameters of the Q function (before the action is obtained), and this is Q~ which is the parameter fixed and used for training.
In the same episode, the parameters of Q~ are fixed, and only the next episode will update the parameters and resample the noise.
insert image description here
Action noise noise on action: Given the same state, the agent agent may take different actions (epsilon greedy, because noise is added to the action, and the randomness is large). No real strategy works like this (hoping for the same state to get the same action).
Parameter noise noise on parameters: given the same (similar) state, the agent takes the same action. → State-dependent exploration, explored in a consistent manner. In an episode, the parameters of the network are consistent, so the same state will output the same thing.
insert image description here

distributional

State-action-value function Qπ(s, a): When using actor π, the expectation is to obtain cumulative reward after seeing observation s and taking a.
Different distributions can have the same expectation. At this time, there will be some loss, and some information is not used.
insert image description here
Each action has 5 boxes, you can choose the same expectation, but less risky action. (This method can output a probability distribution, and the sum of the distribution of the same action is 1)
insert image description here
There may be cases where the reward is underestimated, and extreme values ​​​​will be discarded.

rainbow

Mix all methods
insert image description here
After removing a certain method:

insert image description here
Double DQN was added to avoid overestimating rewards

二、Q-learning for continuous actions

It is relatively easy to estimate the Q equation. As long as you get the Q equation, you can get a good strategy. Problem: Not easy to handle continuous actions.
Before, the action was discrete, and all a's could be exhausted to calculate the best a.
Method 1: Sampling to get N a, find the largest a. But not all a's can be sampled, so might not be accurate.
Method 2: Use the method of gradient ascent to solve the optimization problem. Problem: The global maximum problem has a large amount of computation, and the parameters need to be updated cyclically.
Method 3: The Q-network is specially designed to make optimization easy. The Q equation is input s and a, output V. First input s to get three things (vector, matrix, value), then input a, let a and μ be subtracted, a is a continuous vector at this time, and then do other operations. Here we need to make Q the largest, that is, the first item is the smallest, so that the largest a=μ(s). Here μ is the Gaussian mean, Σ(s) is the positive definite variance (in Qπ, this is not directly output Matrix, but transpose multiplication with a matrix, which is guaranteed to be positive definite).
insert image description here
insert image description here
insert image description here

3. Conjectures about deep learning

Almost all local minima have very similar losses to the global optimum, so finding a local minimum is sufficient.
When we hit a critical point, it could be a saddle point or a local minimum.
insert image description here
There is a positive and a negative is the saddle point
insert image description here
. When E is relatively small, the eigenvalue is likely to be positive, and a local minimum may be encountered. The saddle point is more likely to appear in places with high loss, and the local minimum is easy to appear in places with low loss.
insert image description here
insert image description here
insert image description here
When the critical point is more like a local minimum, the training error is closer to 0.
insert image description here
insert image description here
Deep learning is the same as the spin glass model with 7 hypotheses.
insert image description here
The larger the model, the closer the loss is to a lower value.
If the size of the network is large enough, we can find the global optimal solution by gradient descent, independent of initialization.
insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/Raphael9900/article/details/128588521