MCTS Monte Carlo Tree Search (The Monte Carlo Tree Search)

1 Introduction

  • Monte Carlo tree search isa general name for a class of tree search algorithms, referred to as MCTS. It is a heuristic search algorithm used in certain decision-making processes, and is more effective in games with large search spaces.
  • Looking at the big picture, the main goal of Monte Carlo tree search is to choose the best next step given a game state.
  • Common applications include Alpha Go, chess, and Go AI programs, etc.
  • Algorithm process:
    • Selection
      • Select the node with the largest UCB value.
    • Node Expansion
      • Create one or more child nodes.
    • Simulation (Rollout)
      • Playing the game with a random strategy at a certain node is also called playout or rollout.
    • Backpropagation
      • The entire search tree is updated using the results of the random search.
  • flow chart

2. Choose

  • Calculate the UCB value of each child node, and then select the node with the largest UCB value for iteration.
  • The following is the calculation formula of UCB:
    • [Note] Average value = value / number of explorations.

3. Extension

  • If the node is a leaf node and has been visited, all possible actions for the node are added to the tree.

4. Simulation

  • Simulate unexplored nodes.
  • After selecting the nodes, make random decisions to obtain the value of the simulation.
  • pseudocode:

5. Backpropagation

  • After the value is calculated, the value needs to be backpropagated.

6. Termination conditions

  • Termination condition:
    • given time limit
    • given fixed number of iterations
  • After the iteration is completed, select the node with a larger value to complete the decision.

7. Simulation process

  • First, there is a root node S_0, which has two parameters V_0 (value) and N_0 (number of visits).
  • First iteration. First determine whether S_0 is a leaf node and find that it is indeed a leaf node and has not been visited. We need to rollout it, assuming the value obtained by Rollout is 10, and then need to backpropagate.
  • Second iteration. Visit S_0 again and find that although it is a leaf node, it has already been visited. At this time, you need to enumerate all possible actions of the current node and add them to the tree.
  • Because the access times and values ​​of S_1 and S_2 are both 0, the UCB values ​​are the same and infinite, so you can choose anyone. Select S_1 for access first. S_1 is a leaf node and has not been visited, so it needs to be Rollout.
  • Assume that the value obtained by Rollout is 20, and then backpropagation is performed to update the parameters of S_0 and S_1.
  • [Note] Each iteration needs to start from the root node.
  • The third iteration. Accessing S_0 is not a leaf node. Select the node with the largest UCB value. The number of visits to S_2 is 0 and the UCB value is infinite, so S_2 is selected for access. S_2 is a leaf node and has not been visited, so it is rolled out. Assume that the value obtained by Rollout is 10, and then backpropagation is performed to update the parameters of S_0 and S_2.
  • Fourth iteration. Access S_0, which is not a leaf node and has been visited, so the UCB values ​​of S_1 and S_2 are calculated.
  • Because the UCB value of S_1 is larger, S_1 is accessed. S_1 is a leaf node, but it has already been visited. At this time, all possible actions of the current node need to be enumerated and added to the tree. Suppose S_1 has two actions S_3 and S_4.
  • The fifth iteration. Because the access times of S_3 and S_4 are both 0, the UCB values ​​are both infinite, so we still select the first new node S_3 as the current node, and then perform Rollout on S_3. Assume that the obtained value is 0, and then perform backpropagation to update the parameters of S_3, S_1 and S_0.
  • The sixth iteration. Still starting from S_0, which is not a leaf node and has been visited, and then selects S_1 or S_2 according to the UCB formula. What we need to pay attention to at this time is that in the UCB calculation formula, Vi is The average value of S_1 has been visited twice, so the average value of S_1 is 10 (20/2=10), so the UCB of S_1 and S_2 is calculated as follows:
  • So the next node is selected S_2. S_2 is a leaf node and has been visited, so all actions need to be enumerated and added to the tree. Assume that S_2 has 2 actions S_5 and S_6.
  • The seventh iteration. Select S_5 for Rollout (I won’t specify the reason). Assume that the obtained value is 15, and then perform backpropagation to update the parameters of S_5, S_2 and S_0.
  • If we stop iteration now, we will see whether we should choose S_1 or S_2 based on the results. Obviously, S_2 has a greater value, so choosing S_2, that is, doing the second action, is currently the optimal solution in this tree.
  • [Note] There is a point to note about the UCB formula.
    • If Vi is larger, then UCB will be correspondingly larger, and the larger UCB means the more likely it is to choose this path. The larger Vi means the average value of this node will be higher, and we will be more willing to explore it. But what if it's only Vi? For example, change the calculation formula of UCB to:
    • This won't work, if this happens the nodes that have not been explored will never be explored, which is why there is an item on the right. Especially when N_i is equal to 0, UCB will be equal to infinity, then this node that has not been visited will definitely be explored. As N changes, the corresponding UCB will also change. In short, this UCB formula not only ensures that visited branches can be visited again, but also ensures that we try to visit those paths with greater value, so that we can better complete the entire game.

Guess you like

Origin blog.csdn.net/weixin_45100742/article/details/134434858