Monte Carlo Tree Search (MCTS) in AlphaGo Zero

Monte Carlo Tree Search (MCTS) in AlphaGo Zero

1. Description

        Monte Carlo Tree is a search algorithm that is mainly used to solve decision problems. Its core idea is to evaluate the winning rate of each node by simulating the game process, so as to select the optimal strategy.

        Specifically, the Monte Carlo tree search algorithm includes the following steps:

  1. Build a tree, the root node represents the current state, and each node represents a feasible decision.

  2. Starting from the root node, simulate the game for each child node (for example, for the Go game, you can simulate the game process by playing chess randomly), and record the number of victories and visits of each child node.

  3. Calculate the winning rate of each child node, and select the highest child node as the next decision.

  4. Repeat the above steps until the number of searches or the time limit is reached.

        The Monte Carlo tree search algorithm is widely used in the field of artificial intelligence, especially in the fields of board games and Go.

         In the game of Go, AlphaGo Zero uses MC tree search to build a local policy to sample next moves.

        MCTS searches for possible moves and records the results in the search tree. The tree and its information grow larger as more searches are performed. To move in Alpha-Go Zero, 1,600 searches will be counted. Then build a local policy. Finally, we sample from this policy to take the next step.

2. Basic knowledge of MCTS

        In this example, the current board position is  s₃ .

        In MCTS, nodes represent board positions and edges represent movements.

        For a given position, we can compute:

  • the policy (p is a probability distribution scoring each action), and
  • the value function v (how likely to win at a board position).

        using a deep network f.

        In the example below, it applies a deep network f, composed of convolutional layers, on s₃’ to compute the policy p, and the value function v.

        We can expand the search tree by simulating moves. This adds the corresponding move as an edge and the new board position as a node to the search tree.

        Let us introduce another   term called operation-value function Q. It measures the value of a move for a given state. In (a) below, it moves in red, and f predicts a 0.9 chance of winning. So is 0.9. In (b), it makes one more move and ends up with a winning odds of 0.6. Now, move  a₃  is visited twice, so we set visit count = 2. The Q value is just the average of the previous results, i.e.  W = (0.9+0.6), Q  =  W /2 = 0.75. In (c), it explores another path. Now, a₃  is accessed 3 times and the value is (0.9+0.6+0.3)/3 = 0.6.

        In MCTS, we  incrementally build a search tree using s ₃ as the root. We add nodes one at a time, and eventually after 1.6K searches, we use the tree to build a local policy ( π ) for the next step. But Go has a large search space, we need to prioritize which node to add and search first.

        There are two factors to consider: development and exploration .

  • Exploitation: Perform more searches that look promising (i.e. high  Q-  value).
  • Exploration: Perform searches that we don't know much about (i.e. low number of visits  N ).

        Mathematically, it chooses to move  a based on :

        Q controls development, u , the exploration bonus, controls exploration.

        For frequently accessed or less likely state-action pairs, we have no incentive to explore it more. Starting at the root, it uses this tree strategy to choose which path to search next.

        In the beginning, MCTS will focus more on exploration, but as iterations increase, most of the searches are exploits, and Q  becomes more and more accurate.

        AlphaGo Zero iterates the above steps 1,600 times to expand the tree.

        Surprisingly, it does not use  Q  to construct the local policy  π . Instead, π₃  is derived from  the access count  N.

After the initial iteration, moves with higher Q  values         ​​will be visited more frequently  . It uses access counts to calculate policy since it is less prone to outlines. τ is the temperature controlling the level of exploration. When τ is 1, it selects moves based on access counts. When τ → 0, only the move with the highest count will be picked. So τ = 1 allows exploration, while τ → 0 does not.

Through board positions, MCTS calculates a more accurate policy π to decide on the next move.

        MCTS improves policy evaluation, which uses new evaluations to improve policy ( policy improvement ). It then reapplies the policy to evaluate the policy again. These repeated iterations of policy evaluation and policy improvement are called policy iterations in RL . After playing a lot of games by myself, both policy evaluation and policy improvement will be optimized to the point where they can beat masters. See this paper for details , especially how  f is trained  .

3. Principle of MCTS

        Before going into detail, here are the 4 main steps in MCTS. We will start at board position s as shown  .

  • Step (a) Select the path (move sequence) to be searched further. Starting from the root node, it searches for the next operation that has the highest value in terms of  grooming Q (Q = W/N ), and  exploration rewards that adversely affect visit count N. The search will continue until it reaches an operation that does not have a node attached.

  • Step (b) expands the search tree by adding the state (node) associated with the last operation in the path. We add new edges for operations associated with new nodes, and for each edge record  the p  computed  by f .
  • Step (c) Update   the current paths on W  and  N backwards.

After repeating steps ( a  ) to (c) 1,600 times, we   create a new policy  π₃ with access count N. We sample from this policy to determine the next move of s₃ .

Next, we will describe each step in detail.

choose

The first step is to select a path from the tree for further searching. Suppose our search tree looks like this.

It starts at the root and we choose the move according to the following equation:

        where u  controls exploration. It depends on  visit counts, P ( s, a)  policy  from f on how likely we are  to choose  a for state s  , and c  a hyperparameter controlling the level of exploration. Intuitively, the less frequently we explore an edge, the less information we gain, and thus the higher our reward for exploring it. Q controls development, which is calculated as a function of  the Q  value ( W/N ).

4. Expansion and Evaluation

        Once the selected path is determined, a leaf node is added to the search tree for the corresponding operation. It computes policies p  and  v from the added nodes  using a deep network  .

        Then, for each possible operation on a new node, we add a new edge (s, a). We initialize the visit counts  N , W  and  Q  to 0 for each edge. We record the corresponding  v  and  p .

backup

Once p  and v  are computed for a leaf node  , we back up v to update the operation value Q  for each edge (s, a) in the chosen path  .

Play

        To decide the next move, AlphaGo Zero creates a new local policy ( π) based on the visit count of each child of s₃. It then draws samples from this policy for the next step.

        The selected move  a  will be the root of the search tree and all its children will be kept while others will be discarded. It then repeats MCTS again for the next step.

V. Postscript

        Regarding the Monte Carlo algorithm, we will issue more documents for discussion.

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132324653