Deep reinforcement learning - AlphaGo example explanation (5)

Now let's analyze the example of AlphaGo and see how deep reinforcement learning is used to play the game of Go

The main design ideas of AlphaGo:

The first is training, which should be done in 3 steps:

1. Behavior cloning: This is a kind of imitation learning. alphaGo imitates human players and learns a strategy network from 16W human games. Behavior cloning is a kind of supervised learning. In fact, it is multi-classification, not reinforcement learning. AlphaGo uses behavior cloning to initially learn the strategy network.

2. Use reinforcement learning to further train the strategy network. Specifically, use the strategy gradient algorithm. AlphaGo allows the strategy network to play a self-game, and train the strategy network with the result of winning and losing. Reinforcement learning can make the strategy network stronger

3. To train a value network, AlphaGo does not use the actor-critic algorithm. The actor-critic algorithm must train the value network and the policy network at the same time. AlphaGo trains the policy network first, and then uses the policy network to train the value network

When AlphaGo and Li Shishi played chess, they did not use the strategy network, but the Monte Carlo tree search. When searching, they used the value network and the strategy network. They can guide the search and eliminate unnecessary search actions.

 The architecture of the policy network & how to train the policy network:

The meaning of 17: represent the current position of the black chess piece with 1 matrix, and represent the position of the black chess piece in the previous 7 steps with another 7 matrices. In order to represent the black chess piece, 8 matrices are needed. Similarly, in order to express White pieces need 8 matrices. The first 16 matrices describe the position of the black and white pieces. If the 17th matrix is ​​all 1, it means that it is time to play the black piece. If it is all 0, it means that it is time to play. white pawn

The number 8 is a hyperparameter, which was tested by using it.

At this time, the state can be represented by this 19*19*17 tensor

How to design?

Use a tensor to represent the state of AlphaGo, and use the tensor as the input of the state network

Finally, use 1 or more fully connected layers to output a 361-dimensional vector

The activation function of the output layer must use softmax, because the output is a probability distribution

Go has a maximum of 361 actions, so the output of the neural network should be a 361-dimensional vector. Each element of the output vector corresponds to a position where a chess piece is placed, that is, an action. The elements of the vector are the probability values ​​of each action.

Since it takes too long to train a neural network from scratch, first learn directly from the human race you recorded 

Behavior cloning only needs to let the policy network imitate human actions. It does not require rewards, nor is it reinforcement learning. The difference between imitation learning and reinforcement learning is whether there is reward or not.

Use tensor to represent the pattern on the chessboard:

The action at*=281 is one-hot encoded, and it becomes a 361-dimensional vector. This vector is all 0, only the 281st element is 1, and this one-hot encode is recorded as the vector yt

Use CrossEntropy to measure the difference between the human player's action yt and the policy network's prediction pt as a loss function

In fact, if you think about it carefully, behavior cloning is multi-category. There are 361 positions on the board. In fact, there are 361 categories. The output of the strategy network is the probability of each category. The action of a human player is one of the 361. Actions are regarded as ground-truth real labels. In fact, this problem is exactly the same as that of image classification. There are categories of cars, cats, and dogs in image classification, and the categories here are 361 positions. The target in image classification is cats and dogs. label, where the target is the position where the human player places the pieces, so behavior cloning is multi-category, with 361 categories

What is the biggest flaw of behavior cloning?

"The current state st does not appear in the training data (the policy network has not seen the current state st)

After reinforcement learning, even if the current state on the board is very strange, the policy network can cope well

Specifically, how to use reinforcement learning to train the policy network?

》AlphaGo lets two strategy networks play the game, one is called Player and the other is called Opponent

Every time a sentence of Go is played, the winner or loser is used as a reward, and the parameters of the Player are updated by the reward. If the Player has no next move, the Opponent has to follow it, which is equivalent to a random state transition. The Opponent is also controlled by the strategy network, but the Opponent The parameters do not need to be learned

So how do you define rewards?

The previous rewards are all 0, only the last reward is either -1 or 1

How to understand intuitively?

》If the agent wins, every move is a good move, and if the agent loses, every move is a bad move. We can't distinguish which move is a good move and which is a bad move in a game. We can only treat each move equally and get the final result. The result speaks for itself, giving all actions the same reward

After playing a game, we will know the value of ut, which is either +1 or -1. We can also specify the parameter θ=θt of the policy network Π to calculate the approximate policy gradient

The ut here is not the previous ut, but the rt at each moment, so it needs to be added

There is still a small problem now that the strategy network may make mistakes and lead to losing the game, that is, the model is unstable. A better way than the strategy network is Monte Carlo tree search

In order to do Monte Carlo tree search, a value network is also needed. The value network here is different from the previous one. The value network here is an approximation to the state value function V instead of an approximation to Q.

The latest AlphaGo Zero is to let two neural networks share a convolutional layer. This is because both networks need to use a tensor with a state of 19*19*17 as input, and the underlying convolution extracts features from the input, and these The feature is applicable to both neural networks, so it is reasonable to let the two neural networks share a convolutional layer

The output of the value network is a scalar, which is the score of the current state s, indicating the current odds of winning

The policy network and the value network are trained separately, not at the same time. AlphaGo first trains the policy network Π, and then trains the value network V. The value network V is trained with the help of the policy network Π, while the actor-critic algorithm is trained simultaneously. two networks

The learning of the value network is like a regression problem, taking the real observed ut as the target, and the prediction of the value network is v(st,w)

So where does the strategic network assist the value network?

When training the value network V, the first step is to use the strategy network for self-play

To recap:

1. First, use imitation learning to initially train the strategy network based on human chess records

2. Use the policy gradient algorithm to further train the policy network

3. After finishing training the policy network, train a value network V separately

So far the training of AlphaGo is over

So in actual combat, does AlphaGo use a strategy network or a value network?

"Neither, AlphaGo used Monte Carlo tree search in actual combat. Monte Carlo tree search does not require training and can be used directly to play chess with others. The two neural networks learned before are to help Monte Carlo tree search

When humans play chess, they have to count several steps backwards, which is more likely to win. This is why AI will look forward, not just use the strategy function to calculate an action. For example, you may get Satisfied, but in the future you may fail the exam, and in the end the gain outweighs the loss. Although this action is optimal at the moment, it may not be so in the future

Main idea of ​​the search:

1. To choose an action a, of course, you have to choose it with different probabilities according to the quality of the action. There are many feasible actions and it is impossible to use the enumeration algorithm, so to exclude bad actions, only search for good actions, use What is the policy network, to rule out bad actions (the probability value is relatively low)

2. Let the strategy network do self-game until the end of the game to see whether it wins or loses this time

3. Then score a according to the two factors of winning or resuming the value function

4. Repeat this process many times, so each action has many points

5. You can see which action has the highest total score, this score can reflect the quality of the action, and AlphaGo will execute the action with the highest total score

Monte Carlo tree search specifically does this:

Scores consist of two parts:

1. Q(a), which is the score calculated by searching, is called the action value. In fact, in the example of Go, Q(a) is a table that records the scores of 361 actions.

2. The other is the score given to a by the policy network Π divided by (1+N(a)), where N is the number of times action a is selected, the better the action, the higher the score given to a by the policy network Π , this item will be larger, but if action a has been explored many times, the denominator N(a) will become larger, reducing the score of action a, which can avoid exploring the same action too many times, η is super Parameters need to be adjusted manually

The following processes are simulated by Go

1. At the beginning, all Q(a) = 0. At the beginning, the policy function Π decides which action to explore. After many searches, the value of N(a) will become larger, making the second item change is very small, so the policy function Π becomes irrelevant, and which action to explore is completely determined by Q(a)

2、

The opponent here is equivalent to the environment. The strategy here is the state transition function. The opponent’s actions will generate a new state st+1. Although I don’t know what the opponent thinks, that is, I don’t know the state transition function, but we can use Π to approximate it. p

3. Evaluation

Starting from the state st+1, let the strategy network do self-game later, both sides are controlled by the strategy network, and the two sides place the pieces one by one until the winner is determined. At this time, the reward rt is obtained, winning +1 and losing -1, this reward rt can be used to evaluate the quality of the state st+1

In addition to using rewards to evaluate st+1, Go also uses V to evaluate. The value network V is trained before, and the state st+1 is directly input.

Since this simulation will be repeated many times, there will be many records in each state, and each action at will have many such child nodes, so at will correspond to many records, and all the records under at will be averaged. as at new value Q(at) 

Explain the purpose of Go to calculate this Q. The first step of the Monte Carlo tree search is to select the best action to search. This Q value is used when making a selection. The Q value is the V of all records. average value

Assuming that thousands of searches have been done, it is obvious which action is better at this time, and AlphaGo can be restricted to make real decisions

The larger the Q value and Π of an action a, the greater the N(a), so N(a) can reflect the quality of the action. AlphaGo's decision-making is very simple, that is, to select the action with the largest N value and execute this action

Every time AlphaGo takes a step, it has to perform thousands of simulations. Each simulation will repeat the above four steps. After thousands of simulations, AlphaGo will have the Q score and N score of each action, and AlphaGo will choose The action with the largest N value, execute this action, and make a real move . In order to take this move, AlphaGo has performed thousands of simulations

When Li Shishi finished one step and it was AlphaGo's turn again, he would do a Monte Carlo tree search again, this time re-initialize Q and N to 0, and then do thousands of simulations

Brief summary:

Training a value network is doing regression

While it is possible to use policy networks to play chess, a better approach is Monte Carlo tree search

The old version is imitating human players, the new version is imitating Monte Carlo tree search

If it is in a virtual environment, behavior-cloning may be harmful, but in the physical world, it is still necessary because it can minimize losses

How does the new version of AlphaGo train the policy network?

1. Observing state st

2. Let the policy network make a prediction, output the probability value of each action, and record the output of the policy network as a vector p, which is a 362-dimensional vector 3. Do Monte Carlo tree search and do many simulations, you will
get The number of times each action is selected Na, normalize these 361 numbers Na, let them become probability values, denoted as n

4. We hope that the decision p made by the policy network is close to the decision n made by the search, that is, reduce L

5. Use gradient descent to reduce Loss

Guess you like

Origin blog.csdn.net/Tandy12356_/article/details/130240072