[Deep learning] Reinforcement learning

The pictures in the article are from the Mofan python video. The content of the video is really great, you can watch it in one day, and friends who are interested can click the link below.

https://space.bilibili.com/243821484/#/channel/detail?cid=26359


come on, babies! Let's talk about enhanced learning together~hhhh

Reinforcement learning, to put it simply, is to give rewards for what is done right, and to punish what is done wrong. It is super simple and rude. The essence of punishment and reward can also be reversed with gradient descent.

The biggest feeling is that in the process of finding the optimal solution, one step is taken to tell you if it is right, another step is taken to tell you if it is right, or after the walk is finished, it tells you if it is right. Then update. This algorithm has a table. This table will tell you how well the optimization direction is selected. The predicted value is based on previous experience. We will also update this table according to the quality of each step or the final result.

Let's take a look at a few classic reinforcement learning algorithms together~

1. Q-learning:

A q table records the score of the selected path.


In order to be more clear about how the formula in the picture is calculated, I directly use the numbers to calculate it again.

q Reality = 0+0.9*2 = 1.8

q estimate = 1

New q = 1 + alpha * (1.8-1)

Then we can modify the 1 corresponding to s1 and a2 in the figure to 1 + alpha * 0.8. Then continue to calculate. . .

Note that Q-learning chooses the largest one of the subsequent paths to calculate q reality, that is, if I choose a1 in the case of s2, then my q reality is also 1.8.

2. Here can lead to the sarsa algorithm , sarsa is to choose which way to go.

Reflected in the update process, Q-learning remembers whether to eat or not, it is always valuable after selection. Sarsa is punished every time you make a wrong choice, so you don’t dare to go, even if the other branch of this road may be right, but if the wrong road is punished too much, the back road will be very difficult. It's hard to go.


Of course, in order to solve this kind of one-shot transaction that directly blocks the road, a 10% random selection process is added to the algorithm to ensure that even if the road is not good, the algorithm may try again in this direction.


3. The combination of Q-learning and neural network creates DQN (Deep Q Network)

Combine it with the neural network and then want to record the parameters and you can't use a table to solve it. We will use two networks to record the results, one is the actual network, and the other is the estimated network. The two networks are updated asynchronously. The actual network is the one that is slow to update. It may be that the network parameters from a few steps ago are stored to disrupt the experience correlation.

(1) Network input: state (s) and action (a); output: value

(2) Network input: state (s); output: action (a) and corresponding value

dqn has a memory bank, which records previous experiences for learning and disrupts the relevance of experiences.



4. policy gradient

The input is the state, and the output is the probability of the action or action. Based on neural network, make certain action easier to be selected. Maximize the probability of good moves. The final result is not a definite path, but a path selected by probability. It can only be updated after the end of a round.


5. actor critic

actor: policy gradient to make an action

critic: q-learning predicts the value of this action

The combination of the two can achieve a single-step update, which is more efficient. Rather than knowing whether each step is good or bad at the end of a round like the policy gradient.


6. ddpg

Two actors, two critics. Information will be exchanged between each other.


Guess you like

Origin blog.csdn.net/Sun7_She/article/details/80735704