Gobang algorithm design based on reinforcement learning-complete implementation of python code

table of Contents

1 Curriculum design purpose

2 Design tasks and requirements

3 Design principle

3.1 Reinforcement learning

3.2 Monte Carlo tree search

4 Model introduction

4.1 Simulation

4.2 Runner

4.3 Neural Network

5 Simulation process and results

Reference


Source code download address that can be run directly: https://download.csdn.net/download/weixin_43442778/15877803

Video demo link: https://live.csdn.net/v/157249

1 Curriculum design purpose

Because of the epidemic, I will face the boringness of home isolation for a long time. It is better to play a game of chess with myself, "The next game of gomoku!". This course design deepens the understanding and application of the concept of reinforcement learning in machine learning through the design of Gobang algorithm.

2   Design tasks and requirements

Google's artificial intelligence company DeepMind has released a new paper that describes how the team used AlphaGo's machine learning system to build a new project AlphaZero. AlphaZero uses an AI technology called reinforcement learning. It only uses basic rules and has no human experience. It trains from scratch and sweeps the board game AI. The tasks of this course design are as follows:

1. Give a video of "confrontation between yourself and your own program", and add your own unique label to your chessboard as a demonstration of your own program (anti-plagiarism), for example, chess pieces have their own design.

2. Fill out the course design report according to the provided template.

Figure 2.1: Gobang Book

3   design principles

3.1 Reinforcement learning

Reinforcement learning here is mainly composed of two parts, one part is environment (environment), and the other part is strategy (policy). The environment is composed of three parts (state, action, and reward). In layman's terms, the environment is a black box function whose output is the current state and the reward of the previous action, and accept The input is action. To use Go as an example, the position of the pawn on the current board of Go is the state, and if we choose to play a move, the state of the board is changed (one more word), and the move we chose to play before Good or bad is our reward. The policy is abstracted here as a function of input state and output action. Policy is more similar to the human thinking process. A chess player (policy) makes a move (makes an action) by observing the state of the chessboard. Therefore, reinforcement learning can be understood as a function of finding an input state and output action to maximize the reward of our environmental feedback.

3.2 Monte Carlo tree search

The following introduces Monte Carlo Tree Search (MCTS) combined with reinforcement learning, which is different from ordinary Monte Carlo trees. Monte Carlo tree search is essentially a tree with different nodes (nodes), and nodes are connected to each other. Each node can represent the state of a chessboard here. Assuming that the size of our chessboard is 15*15, and the state of the initial chessboard (the state where there is nothing on the chessboard) is our initial root node state, and the next Theoretically, there are 225 sub-nodes, which represent the state of the board when the initial player plays at 225 different positions. With these 225 byte points, each sub-node can theoretically have up to 224 sub-nodes individually. And so on. For each node, the number of visits of the node (counter) and the Q value of each child node are stored on it (each child node represents an action {action} to move the child relative to the parent node, and the Q value is in This can be briefly understood as a score for the quality of the action).

4   Model introduction

First of all, in our training process, every time the model makes a move, there will be two processes. One process is simulation, and the other is play out. The simulation process can be understood as the prediction (simulation) made before we officially move, and the actual move (play out) is performed according to our prediction results.

4.1 Simulation

Different from the traditional Monte Carlo tree search, the simulation of Alpha Zero is a simulation guided by the output result of the neural network. The traditional MCTS algorithm is shown in the figure (picture from Wikipedia).

Figure 3.1: Traditional MCTS algorithm

The simulation process is usually carried out many times, and we will first explain one of them. For the traditional Monte Carlo tree search, the first step is to select an "optimal" action (the best here is quoted, because this is not really the best, but a comprehensive estimate of the current action + The measurement of confidence, I will talk about it later when I talk about the UCB formula.). We continue to select the "optimal" action until we come to a node and select an action that we have never visited before, so there is no node corresponding to that action. At this point we have completed the selection part in the figure. Then enter the expansion part, where we create a new node, remember that only one node is created for each simulation. Then enter the simulation part of MCTS (note that the simulation here is not the same as the simulation mentioned above, make a distinction) traditional MCTS will carry out a random strategy, abstracting it on the board is equivalent to random moves on the board. The random walk has come to an end, and we come to the Backpropagation part. We update the results of the random walk (the number of visits, wins and losses, etc.) to all the nodes experienced in this simulation, that is to say, recursive update The parent node of all child nodes. (For a node, its execution of an action will advance to the node below it, the node below it is the child node, and relative to the child node, the node is the parent node. The node in the figure (12/21) It is the parent node of node (7/10), node (5/8), node (0/3), and these three nodes are child nodes of (12/21)).

The MCTS we use here is different from the traditional MCTS. First, in the selection phase, we select nodes based on the guidance of the neural network. Each time we select the node with the largest upper confidence bound value.

The UCB formula is:

among them

Here UCB is a formula that weighs confidence (or variance) and exploration value. In layman's terms, actions that have not been explored and actions that have been explored many times but with high exploration feedback will have a larger UCB value. The Q value represents the value explored in the Monte Carlo tree, and the update of the Q value is the average value of each backpropagation feedback in the previous Monte Carlo tree search. The P value is the prior probability, which is calculated by the neural network, and N is the number of explorations of the action, and c is a hyperparameter used to adjust the exploration of the balance model. So the characteristic of this formula is that it has a very high value for an action that is repeatedly explored and has a good reward, and an action that lacks exploration. In this way, the balance between exploration and optimization can be well balanced.

The second difference is that in the backpropagation stage of MCTS, if the current node has no winner or loser, it is scored by the neural network, and the feedback is the value instead of the result of random exploration. We will continue to explain what the specific value is in the neural network stage.

4.2 Runner

When the simulation process is over, we enter the walking process. In this step, we use the ratio between the exploration times of different actions as our actual probability distribution, and we choose our actions according to the probability distribution (the probability can add a little noise to do some additional exploration). The probability distribution here needs to be saved and used when training the neural network. We follow the process of "simulation-movement-simulation-movement" until the end of our game. At the end of the game, we save the game record, which can be used for neural network training.

4.3 Neural Network

The input of the neural network is the current chessboard. The original input is a large 17x19x19 matrix, with 8 history records representing one's own chess, 8 representing history records of the opponent playing chess, and the other representing the current player. We also do some simplifications here, keeping the current player's input information, and compressing both our own history and the opponent's history into one channel. So our input is a matrix of 3 x board size x board size. The output is a dual port, which is the value of the current board's state and the probability of a move at each position of the current board. The updated Loss function is defined as:

Where v is the value and z is the actual victory or defeat when the game is over. For the current player, the value of z is +1 if he wins, and -1 if he loses. Here we make a regression loss. After training on big data, the neural network will master a macro probability of winning or losing. The second term is the cross-entropy between the output of our neural network and the probability branch explored by MCTS, and the third term is an L2 regular term.

Here we choose ordinary SGD for optimization. Although Adam optimization has advantages such as better convergence speed, the convergence performance may not be better than SGD.

 

5   Simulation process and results

Figure 5.1 Start interface

 

Figure 5.2 Final result

Reference

  1. https://zhuanlan.zhihu.com/p/59567014

Guess you like

Origin blog.csdn.net/weixin_43442778/article/details/114950912