Vernacular explanation of DQN (DeepQ-Learning) reinforcement learning algorithm (gobang nine palace game example)

introduce

This article discloses the source code of a dqn-based automatic chess-playing algorithm for Jiugongge game and backgammon game, and explains the idea.

Source address: https://gitee.com/lizhigong/DQN-9pointgame

Recently, learning the DQN algorithm has taken a lot of detours and stepped on a lot of pitfalls. Let me sort it out here. First, I will keep a record of my learning process. Second, I will share it with you while the pits are relatively hot.

In the code, the Jiugongge game in ANN has been trained

There is also an 8*8 backgammon game based on CNN. You can try to train it yourself. The effect is not bad.

1. Introduction to Q-Learning

The idea of ​​Q-Learning is not very complicated, and many articles have detailed introductions. Here is just a simple example without a detailed explanation.

For example, choose the shortest way to go home, and the agent may appear in any position in the box below, and the route is shown in the figure below.
insert image description here

So how to use Q-Learning to solve the problem of route selection?

1. Code all boxes with numbers (value numbers)

2. When choosing the next route, select the adjacent box with the highest value, and you can follow the nearest route to go home.

The value figures are as follows:
insert image description here

So here comes the question: I filled in the above number myself, so how to determine this number in terms of machine learning?

1. Initialization (all boxes are 0)

2. Set the reward value (100 points for reaching home)

3. Choose a box arbitrarily and start walking. Every time you take a step, look at the number with the highest score in the connection line adjacent to the box you are in, and then multiply this number by a coefficient (there are fewer boxes in the legend, Therefore, it is fixed to subtract 10, so that 0 will not appear. If there are many boxes, many 0 boxes will appear with the method of subtracting the coefficient. These 0 boxes are not easy to choose the route, so multiply by one coefficient), fill in this box. Then iterate over and over again, until finally, the numbers are fixed.

Then the formula of Q-leaning comes out naturally

insert image description here

Each box here represents a state, and Q(S,A) refers to the state value of the target box, and is also called the action transfer value to the target position. This is relatively convoluted, and beginners can directly understand it as the value of the box. Value (the expected value of obtaining rewards, the probability value of obtaining rewards, etc., there are many names, and they are always the same, mainly to understand the meaning inside). R refers to the reward value (100 points for reaching home). The Arabic R is the coefficient mentioned above. If there is no such coefficient, all the boxes will become 100 in the end, and there is still no way to choose a route. max(Q) is the maximum value that the target position can achieve in the next step, and it can also be described as the maximum transfer value of the target position in the next step. I don’t know if my description is easy to understand.

This state and the transition between states can be made into a value transfer table. By iteratively improving the value information in the table, this process is called Q-leaning.

2. Introduction to DQN

DQN is also called deepQ-Learning, adding a Deep in front of Q-Learning. Q-Learning has a shortcoming. If there are too many states, such as the backgammon board, each position has three states (blank, black, and white), then if a 10*10 board has 3^100 states, then There is no way to make this Q table. Then we have no way to build this Q table to obtain the state value state transfer value.

DQN is to build an artificial neural network, the input is the current state, and the output is the state transition value. Or the input is the current state and the output is the Q value of the current state. Through multiple iterations of training, the output of the neural network is close to the real Q value (approximate rather than equal, because it is a neural network after all, the number of parameters, and the storage usage are much smaller than the Q table. If it can be completely equal, it is necessary to store dry What)

Then the training loss of the neural network is the square of the difference between the predicted Q value and (max (the real Q value of the next step) multiplied by the coefficient + reward value). The predicted Q value is the Q value output by a forward propagation of the neural network, and the real Q value is the Q value predicted by the neural network. Why is the real Q value the Q value predicted by the neural network? Because the neural network will affect the output value every time it is trained, if the real Q value keeps changing, then the neural network cannot converge. So it is necessary to build another neural network with the same parameters to generate the real Q value. The network that generates the real Q value does not need to be trained. It only needs to iterate a certain number of times and copy the parameters of the prediction network. It's like a stupid teacher teaching a student. After the student learns it, he becomes a teacher and teaches new students. Then the student becomes stronger and stronger than the blue.

The method used in the code in this article is to save the Q value of historical prediction, and then use these Q values ​​to train the predicted Q value of each step after a chess game is over, so that a neural network is enough. It is equivalent to a smart student, who keeps reviewing, summarizing, and summarizing, and then gradually becomes stronger.

3. Introduction of confrontation algorithm

According to the Q-Learning algorithm introduced above, it solves the problem of a single agent, how can this agent obtain the maximum return with the minimum cost. But the learning process of the game is different. There are two agents in the game, and there will be many next states corresponding to the current state and the current action, because we don't know how the opponent will play. So what state is fixed for the current state and the current action? opponent's state. So can I predict the maximum Q value that the opponent can achieve in the next step? What is the relationship between the opponent's Q value and my Q value? For a zero-sum game, the opponent's advantage is my disadvantage, and the opponent's disadvantage is my advantage. Then I can multiply the opponent's Q value by a negative coefficient to train the current Q value. That's it.

The training process is to play a game of chess with yourself first, and record each step and the maximum Q value predicted by each step. After the chess game is over, use the neural network to "review" the entire chess game, and train with the recorded steps and Q value.

4. Things to pay attention to during training

According to common sense, we all choose the action with the largest Q value to make a move. This is no problem, but we are here to train the network. If we choose the largest step every time, we will easily fall into a deadlock. The winning side has been using the same or similar routines to beat the losing side, and the neural network loses quickly, but it still can’t move correctly, or it only has a good grasp of a certain style of chess, right? There is no way to deal with people who play cards by routine. Then we need to add a random event, some steps are taken according to the maximum value, and some steps are taken randomly, but the maximum Q value is calculated every time and saved for review training. In this way, a Jiugongge game that can move correctly will be trained quickly.

Different chess pieces are best placed in different channels. I found that if you use 0 background, 1 white chess, and 2 black chess to put them on the same chessboard, the neural network cannot converge.

Guess you like

Origin blog.csdn.net/u014541881/article/details/128620775