DQN: Intuitive Understanding Version

Reinforcement Learning and Neural Networks

The reinforcement learning methods we talked about before are relatively traditional methods, but now, with the various applications of machine learning in our daily life, various machine learning methods are also converging, merging, and upgrading. And what we want today The reinforcement learning discussed is such a method that combines neural networks and Q learning, called Deep Q Network. Why is this new structure proposed? It turns out that the traditional table reinforcement learning has such a bottleneck.

The role of neural networks

insert image description here
In the original reinforcement learning (left), we used a table to store each state state, the corresponding action and the Q value of the action.

The problem today is that it is too complex, and there can be more states than there are stars in the sky (such as Go and StarCraft). If we use all tables to store them, I am afraid that our computer will not have enough memory, and every time It is also time-consuming to search for the corresponding state in such a large table.

In machine learning, however, there is one method that works well for this kind of thing, and that is neural networks.

  • We can take the state and action as the input of the neural network, and then get the **Q value** of the action after the neural network analysis, so that we do not need to record the Q value in the table, but directly use the neural network to generate the Q value.
  • Another form is this, we can also only input the state value, output all the action values , and then directly select the action with the maximum value as the next action according to the principle of Q learning. (But whether the neural network can Output the action value of the indeterminate item and how to judge which value is the largest or output the indefinite item action + Q value, the answer is below)
  • We can imagine that the neural network accepts external information, which is equivalent to collecting information from the eyes, nose and ears, and then outputs the value of each action through brain processing, and finally selects the action through reinforcement learning.

update neural network

insert image description hereNext, we analyze it based on the second neural network.

We know that the neural network needs to be trained to predict accurate values. How is the neural network trained in reinforcement learning?

  • First, we need the correct Q value of a1, a2, and this Q value is replaced by the Q reality in the previous Q learning.
  • Also we need a Q-estimator to update the neural network.
  • So the parameters of the neural network are the old NN parameters plus the learning rate alpha multiplied by the difference between the Q reality and the Q estimate.

insert image description here
We predict the values ​​of Q(s2, a1) and Q(s2, a2) through NN, which is the Q estimate. Then we choose the action with the largest value in the Q estimate in exchange for the reward in the environment. And Q reality also contains Two Q estimates from the neural network analysis, but this Q estimate is for the next step in the estimate of s'. Finally, the parameters in the neural network are updated through the algorithm just mentioned. But this is not DQN will play electric The root cause of . There are two other factors that support DQN and make it extremely powerful. These two factors are Experience replay and Fixed Q-targets.

Two powerful tools of DQN

To put it simply, DQN has a memory bank for learning previous experiences. Q learning is an off-policy offline learning method, which can learn what you are currently experiencing, what you have experienced in the past, and even the experience of others. . So every time DQN is updated, we can randomly sample some previous experiences for learning.

The practice of random sampling disrupts the correlation between experiences and makes the neural network update more efficient. Fixed Q-targets is also a mechanism for disrupting the correlation. If fixed Q-targets are used, we will be in DQN Using two neural networks with the same structure but different parameters, the neural network that predicts Q estimation has the latest parameters, while the neural network that predicts Q reality uses parameters that are a long time ago. With these two improvement methods, DQN to surpass humans in some games.

Guess you like

Origin blog.csdn.net/weixin_43466027/article/details/116034934