How does the Go AI AlphaGo developed by DeepMind play chess?

In 2016, when DeepMind's Go robot AlphaGo made its 37th move in the second match with Lee Sedol, the whole Go world was shocked. Chess critic Michael Redmond, a professional chess player with nearly a thousand top games experience, was stunned during the live broadcast. He even took the chess piece off the chessboard to observe the surrounding situation, as if to confirm whether AlphaGo played the wrong chess. . The next day, Redmond told the American Go magazine: “I still don’t understand the truth behind this move.” Li Shishi, a master who has ruled the world chess world for ten years, spent 12 minutes studying this game. Just made a response. Figure 13-1 shows this legendary move.

 

Figure 13-1 The legendary move made by AlphaGo in the second game against Lee Sedol. This move shocked many professional chess players

This move completely violates the traditional Go theory. The corners, or pointed chongs, will induce Baizi to continue to grow along the border and create a solid wall. People usually think of this as a 50-50 exchange: White gains an empty spot on the border, while Black gains influence over the central area of ​​the board. But the white chess is 4 squares away from the border. Once black is allowed to make a solid wall, black will get too much territory. (We need to apologize to the Go master who is reading. The description here has been simplified too much.) The sharp punch in line 5 looks a bit amateur-at least the "Professor AlphaGo" finally defeated the legend with four wins in five games. It seems that this is true for chess players before. After this sharp rush, AlphaGo also made many unexpected moves. A year later, from top professional players to amateur club players, everyone is trying to imitate the moves used by AlphaGo.

The second part of the book "Deep Learning and Go " introduces the machine learning and deep learning technologies behind AlphaGo, including tree search, neural networks, deep learning robots and reinforcement learning, as well as several advanced techniques of reinforcement learning. Including strategy gradient, value evaluation method, actor-evaluation method 3 types of technologies; the third part integrates the knowledge prepared in the previous two parts, and finally guides readers to implement their own AlphaGo and an improved version of AlphaGo Zero. After reading this book, readers will have a very comprehensive understanding of the subject of deep learning and the technical details of AlphaGo, which will lay a good foundation for further in-depth study of AI theory and expansion of AI applications.

The editor selects AlphaGo:  Part of the content of the whole collection to answer this question.

We will learn all the structures that make up AlphaGo and understand its working mechanism. AlphaGo is a type of supervised deep learning based on professional chess records (that is, what we learned in Chapters 5 to 8) and deep reinforcement learning based on self-play data (that is, introduced in Chapters 9 to 12) Ingeniously combine, and then creatively use these two deep learning networks to improve tree search. Readers may be surprised that we already know all the components of AlphaGo. To be more precise, we will introduce in detail the following procedures of the AlphaGo system.

  • First, start training two deep convolutional neural networks (ie, strategy networks ) for action prediction. Of these two network architectures, one is deeper and can produce more accurate results , while the other is shallower and can be evaluated faster . We call them a strong strategy network and a fast strategy network, respectively.
  • The strong strategy network and the fast strategy network use a more complex checkerboard encoder, including 48 feature planes. Their network architecture is also deeper than the networks we saw in Chapters 6 and 7. But other than that, they still look familiar. Section 13.1 will introduce AlphaGo's strategic network architecture.
  • In Section 13.2, after completing the first training step of the strategy network, we will use the strong strategy network as the initial point for self-play. If a lot of computing power is used to perform this step, the robot will be greatly improved.
  • In Section 13.3, we will use this strong self-game network to generate a value network . This completes the network training stage, and there is no need to do any deep learning afterwards.
  • To play a game of Go, tree search can be used as the basis of a chess strategy. But unlike the simple Monte Carlo deduction in Chapter 4, we need to use a fast strategy network to guide the next steps. In addition, it is necessary to refer to the output of the value function to balance the output of the tree search algorithm. We will introduce this innovative technology in Section 13.4.
  • The entire process from training strategy networks to self-playing to playing chess using search trees that surpass human players requires huge computing resources and time. Section 13.5 will give a few thoughts to explain how AlphaGo achieves its strength and reasonable expectations when conducting its own experiments.

Figure 13-2 summarizes the entire process we just listed. In this chapter, we will discuss each part of the diagram in depth and provide more details in each section.

 

Figure 13-2 How to train the 3 neural networks behind AlphaGo AI. First, starting with a collection of human chess records, train two neural networks to predict the next move: one network is smaller and faster, and the other is larger and more accurate. Then, we can continue to improve the performance of larger networks through self-play. Self-play also provides data for training a value network. Finally, AlphaGo will use these three networks simultaneously in a tree search algorithm to get a very strong game performance

13.1 Training a deep neural network for AlphaGo

In the previous introduction, we have learned that AlphaGo uses 3 neural networks: 2 strategy networks and 1 value network. Although it may seem a lot, in this section we will see that these networks and their input features are conceptually very close. Regarding the deep learning techniques used by AlphaGo, the most surprising thing is our familiarity with them. This book has already done a lot of introductions to them in Chapters 5 to 12. Before going into the details of the construction and training of these neural networks, let us first discuss the role they play in the AlphaGo system.

  • Fast strategy network -This Go action prediction network has a similar scale to the network trained in Chapters 7 and 8. Its purpose is not to be the most accurate motion predictor, but to make motion predictions very quickly while ensuring a sufficiently good prediction accuracy. The tree search deduction process introduced in Section 13.4 will use this network-and we have already learned in Chapter 4 that in order to be basically usable in tree search, a large number of networks must be quickly created during deduction. We will not discuss this network too deeply, but will focus more on the following two networks.
  • Strong strategy network -the optimization goal of this action prediction network is accuracy, not speed. It is a convolutional network, and its architecture is much deeper than that of the fast strategy network, and the action prediction effect is twice as good as that of the fast strategy network. Like the fast strategy network, this network is also trained with artificial game record data, which is the same as that introduced in Chapter 7. After training this network, you can use it as a starting point for self-playing, and use the reinforcement learning techniques introduced in Chapter 9 and Chapter 10 to improve it. This process can make the strong strategy network stronger.
  • Value network -The self-play of the strong strategy network produces a new data set that can be used to train a value network. Specifically, we will use the output of these games and the techniques introduced in Chapters 11 and 12 to learn a value function. It will play a key role in Section 13.4.

13.1.1 AlphaGo's network architecture

Now that we have a basic understanding of the role of these three deep neural networks in AlphaGo, the next step is to show how to build them in the Python Keras library. Before discussing the code in depth, let's outline the architecture of these networks, as shown below. If the reader needs to review the terminology of convolutional networks, please read Chapter 7 again.

  • The strong policy network is a 13-layer convolutional network. These 13 layers all produce a 19×19 filter, that is to say, in the entire network, we have retained the initial chessboard size. As with Chapter 7, we need to enter the network will be aligned ( PAD ) operation. The core size of the first convolutional layer is 5, and the core size of all subsequent layers is 3. The last layer uses the softmax activation function and has an output filter. The first 12 layers all use the ReLU activation function, and each has 192 output filters.
  • The value network is a 16-layer convolutional network. Its first 12 layers are completely consistent with the strong policy network . The 13th layer is an additional convolutional layer, consistent with the structure of the 2nd to 12th layers. The 14th layer is a convolutional layer with a core size of 1 and an output filter. The network ends with two dense layers. One has 256 outputs and uses the ReLU activation function; the other has a single output and uses the tanh activation function.

It can be seen that the strategy network and value network in AlphaGo use the deep convolutional neural network introduced in Chapter 6. The two networks are very similar, and we can even define them directly with a Python function. Before that, let's take a look at a special use of Keras, which can significantly shorten the definition of the network. As mentioned in Chapter 7, we can use the ZeroPadding2Dutility layer of Keras to align the input image. This is completely fine, but if you move its functions into the Conv2Dlayer, you can save a lot of pen and ink when defining the model. In the value network and policy network, the input of each convolutional layer can be aligned so that the size of their output filter is the same as the input (19×19). For example, according to our previous practice, the first layer has 19×19 input, the second layer has a core size of 5, and the output is a 19×19 filter. The first layer needs to be aligned to a 23×23 image. And now we can directly let the convolutional layer maintain the input size, just provide the parameters when defining the convolutional layer padding='same', it can handle the alignment operation by itself. With this quick definition, then we can easily define the 11 layers shared by AlphaGo's policy network and value network, as shown in the code listing 13-1. Readers can find this definition in the alphago.py file in the dlgo.networks module in the GitHub code base.

Code Listing 13-1 Initialize the neural network for AlphaGo's policy network and value network

from keras.models import Sequential
from keras.layers.core import Dense, Flatten
from keras.layers.convolutional import Conv2D

def alphago_model(input_shape, is_policy_net=False,    ⇽---  这个布尔值选项用来在初始化时指定是策略网络还是价值网络
                  num_filters=192,    ⇽---  除最后一个卷积层之外,所有层的过滤器数量都相同
                  first_kernel_size=5,
                  other_kernel_size=3):    ⇽---  第1层的核心尺寸为5,其他层都是3

    model = Sequential()
    model.add(
        Conv2D(num_filters, first_kernel_size, input_shape=input_shape,
     padding='same',
               data_format='channels_first', activation='relu'))

    for i in range(2, 12):    ⇽---  AlphaGo的策略网络和价值网络的前12层完全一致
        model.add(
            Conv2D(num_filters, other_kernel_size, padding='same',
                   data_format='channels_first', activation='relu'))

Note that we have not specified the input shape for layer 1. This is because this shape is slightly different in the strategy network and the value network. We can see this difference in the code that introduces AlphaGo's checkerboard encoder in section 13.1.2. Continuing modelthe definition, we still have one final convolutional layer to complete the definition of the strong policy network, as shown in the code listing 13-2.

Code Listing 13-2 Create AlphaGo's strong policy network in Keras

    if is_policy_net:
        model.add(
            Conv2D(filters=1, kernel_size=1, padding='same',
                   data_format='channels_first', activation='softmax'))
        model.add(Flatten())
        return model

As you can see, a Flattenlayer needs to be added at the end to flatten the previous prediction output and ensure consistency with the model defined in Chapters 5 to 8.

If you want to return AlphaGo's value network, you can add two more Conv2Dlayers, one Flattenlayer, and two Denselayers, and then connect them, as shown in the code listing 13-3.

Code Listing 13-3 Building AlphaGo's value network in Keras

    else:
        model.add(
            Conv2D(num_filters, other_kernel_size, padding='same',
                   data_format='channels_first', activation='relu'))
        model.add(
            Conv2D(filters=1, kernel_size=1, padding='same',
                   data_format='channels_first', activation='relu'))
        model.add(Flatten())
        model.add(Dense(256, activation='relu'))
        model.add(Dense(1, activation='tanh'))
        return model

Here we do not discuss the architecture of the fast strategy network in detail. The input feature definition and network architecture of the fast strategy network have more technical details, but it does not help us deepen our understanding of the AlphaGo system. So if readers want to conduct their own experiments, they can directly use the networks that we have defined in the dlog.networks module, for example small, mediumor large. The main purpose of the fast strategy network is to build a network smaller than the strong strategy network, which can be quickly evaluated. Next, we will dive into the details of the training process.

13.1.2 AlphaGo Checkerboard Encoder

Now that we have understood all the networks used by AlphaGo, let's discuss how AlphaGo encodes the chessboard data. In Chapter 6 and Chapter 7, we have implemented many checkerboard encoders, including oneplane, sevenplaneand simple. These encoders are stored in the dlgo.encoders module. The feature planes used by AlphaGo will be more complex than them, but they are also a natural continuation of these known encoders.

The chessboard encoder used by the AlphaGo strategy network has 48 feature planes, and its value network needs to add another plane. These 48 planes contain 11 concepts, some of which are what we have seen, others are new, and we will discuss them in detail one by one. In general, compared with previous encoders, AlphaGo makes more use of Go's proprietary jigsaw. The most typical example is the introduction of the concept of quotation and quotation in the feature set (see Figure 13-3).

 

 

Figure 13-3 AlphaGo directly encodes many Go strategy concepts into the feature plane, including the concept of levy. In the first example, White has only one breath left, which means that Black can eat it in the next round. White can grow out to increase his breath, but Black can also move on to reduce White's breath to one bite. This continues until it hits the edge of the board, and the whites will still be all eaten. In another case, if there is already a white child on the path of the levy, White may escape the fate of the captured child. There is a feature plane in AlphaGo specifically to indicate whether the levy can be successful

Before we all go encoder board have adopted a technique that two yuan features ( binary the Feature ), this technique has also been adopted in AlphaGo in. For example, when capturing the concept of Qi (adjacent blank points on the chessboard), we do not only use one feature plane to represent the number of Qi of each chess piece on the chessboard, but use 3 binary expression planes to indicate that a chess piece is There is 1 breath, 2 breaths or 3 breaths. The same approach can be seen in AlphaGo, but it uses 8 feature planes to record binary counts. In the air example, this means that 8 planes respectively represent whether each chess piece has 1, 2, 3, 4, 5, 6, 7 and at least 8 breaths.

The only difference between AlphaGo and the introduction in Chapters 6 to 8 is that it separates the color of the chess pieces and explicitly encodes them in a separate feature plane. Looking back at the sevenplaneencoder in Chapter 7 , our eye plane contains both the black sub-plane and the white sub-plane. AlphaGo only uses one feature set to record the amount of Qi, and all features are for the next round of the performer. For example, in the feature set "number of pieces" (used to record the number of pieces that can be captured by an action), only the number of pieces that can be captured by the current player is recorded , regardless of whether it is black or white.

Table 13-1 summarizes all the feature planes used by AlphaGo. The first 48 planes are used for policy networks, and the last one is only used for value networks.

Table 13-1 Feature plane used by AlphaGo (omitted)

The implementation of these features can be found in the dlgo.encoder module in the GitHub code base of this book, and the file is alphago.py. Although the implementation of each feature set is not difficult, they are not very interesting compared to the other parts of AlphaGo we will introduce. It is more difficult to realize the "levy" plane, and to encode the number of rounds of an action from the time of execution to the present, the definition of the Go board needs to be modified. Therefore, if readers are interested in these implementations, they can refer to the implementation code on GitHub.

Let us see AlphaGoEncoderhow to initialize, and then apply it to the training of deep neural networks. It requires a Go board size parameter and a boolean parameter use_player_plane(representing whether to include the 49th plane). Listing 13-4 shows its signature and initialization process.

Code Listing 13-4 Signature and Initialization of AlphaGo Checkerboard Encoder

class AlphaGoEncoder(Encoder):
    def __init__(self, board_size, use_player_plane=False):
        self.board_width, self.board_height = board_size
        self.use_player_plane = use_player_plane
        self.num_planes = 48 + use_player_plane

13.1.3 Training AlphaGo style strategy network

After the network architecture and input features are ready, we start to train the policy network for AlphaGo. The first step is exactly the same as the process in Chapter 7: Specify a board encoder and an agent, load the game record data, and use these data to train the agent. Figure 13-4 shows this process. Although we used more complex features and networks, the process is exactly the same.

 

 

Figure 13-4 AlphaGo's policy network supervision training process is exactly the same as that introduced in Chapter 6 and Chapter 7. We replay the artificial game record and regenerate a series of game states. Each game state is encoded as a tensor (this figure shows a tensor with only two planes, while AlphaGo actually uses 48 planes). The training target is a vector with the same size as the chessboard, and fill in 1 at the actual placement point

To initialize and train AlphaGo's strong strategy network, you need to initialize one first AlphaGoEncoder, and then create two Go data generators for training and testing, as shown in the code listing 13-5. This step is the same as in Chapter 7. The code for this step can be found in the examples/alphago/alphago_policy_sl.py file on GitHub.

Code Listing 13-5 Load data for the first training of AlphaGo's policy network

from dlgo.data.parallel_processor import GoDataProcessor
from dlgo.encoders.alphago import AlphaGoEncoder
from dlgo.agent.predict import DeepLearningAgent
from dlgo.networks.alphago import alphago_model

from keras.callbacks import ModelCheckpoint
import h5py

rows, cols = 19, 19
num_classes = rows * cols
num_games = 10000

encoder = AlphaGoEncoder()
processor = GoDataProcessor(encoder=encoder.name())
generator = processor.load_go_data('train', num_games, use_generator=True)
test_generator = processor.load_go_data('test', num_games, use_generator=True)

Next, we can use the alphago_modelfunction defined earlier in this section to load the AlphaGo policy network, and use the classified cross-entropy loss function and stochastic gradient descent method to compile the Keras model, as shown in the code listing 13-6. We call this model alphago_sl_policy to indicate that it is a policy network that uses supervised learning (sl is short for supervised learning).

Code Listing 13-6 Use Keras to create an AlphaGo policy network

input_shape = (encoder.num_planes, rows, cols)
alphago_sl_policy = alphago_model(input_shape, is_policy_net=True)

alphago_sl_policy.compile('sgd', 'categorical_crossentropy', metrics=['accuracy'])

Now there is only the last step left in the first stage of training. As in Chapter 7, use the training generator and test generator to call this strategy network fit_generator. Except that the network is larger and the encoder is more complex, everything else is exactly the same as Chapters 6 to 8.

After the training is over, we can create one from modeland and store it (as shown in the code listing 13-7) for use in the two training phases discussed later.encoderDeepLearningAgent

Code Listing 13-7 Training a policy network and persistent storage

epochs = 200
batch_size = 128
alphago_sl_policy.fit_generator(
    generator=generator.generate(batch_size, num_classes),
    epochs=epochs,
    steps_per_epoch=generator.get_num_samples() / batch_size,
    validation_data=test_generator.generate(batch_size, num_classes),
    validation_steps=test_generator.get_num_samples() / batch_size,
    callbacks=[ModelCheckpoint('alphago_sl_policy_{epoch}.h5')]
)

alphago_sl_agent = DeepLearningAgent(alphago_sl_policy, encoder)

with h5py.File('alphago_sl_policy.h5', 'w') as sl_agent_out:
    alphago_sl_agent.serialize(sl_agent_out)

For the sake of brevity, in this chapter we do not need to train strong policy networks and fast policy networks separately as described in the AlphaGo paper. We do not separately train a smaller and faster policy network, but directly use alphago_sl_agent as the fast policy network. The next section will introduce how to use this agent as a starting point for reinforcement learning to generate a stronger policy network.

If you want to continue to explore after reading this, you can read "Deep Learning and Go"

 

1. This book is a practical introductory tutorial for artificial intelligence. It has successfully transformed AlphaGo, one of the most exciting milestones in the field of artificial intelligence, into an excellent introductory course;
2. Using the Keras deep learning framework, Python to implement the code;
3. Comprehensive content, detailed hierarchical division, basically covering all the theoretical knowledge behind AlphaGo;
4. Provide supporting source code.

Go, an ancient strategy game, is a particularly applicable case of AI research. In 2016, a system based on deep learning defeated the world champion of Go and shocked the entire Go world. Soon after, the upgraded version of this system, AlphaGo Zero, used deep reinforcement learning to master the skills of Go and easily defeated its original version. Readers can learn the deep learning techniques hidden behind them and build their own Go robots by reading this book!

This book introduces deep learning techniques by teaching readers to build a Go robot. With the deepening of reading, readers can adopt more complex training methods and strategies through the Python deep learning library Keras. Readers can appreciate how their robot masters the Go skills and find out how to apply the learned deep learning techniques to a wide range of other scenarios.

Guess you like

Origin blog.csdn.net/epubit17/article/details/114475822