Reinforcement Learning: An Introduction study notes (5)

Introduction to Reinforcement Learning

1.5 Extended example: tic-tac-toe

https://blog.csdn.net/thousandsofwind/article/details/79745086

(Note: I have tried many times but there is no way to send the full text, fans)

To illustrate the general concept of reinforcement learning and compare it with other methods, we next consider an example in more detail.

​ Think of the tic-tac-toe game that our familiar children play. Players play on a board with three rows and three columns. One player plays an X and the other plays an O. If three consecutive pieces of X or O fall on a row or column or on the same diagonal line, the game wins; if the board is full and the winner cannot be determined, it is a draw. Let us assume that we are playing against an imperfect player whose tactics are sometimes incorrect and allow us to win. And, let's consider that draws and losses are equally bad for us. How do we build a robotic player that can find the player's loopholes in the opponent's game and maximize his chances of winning?

Although this is a simple problem, it is difficult to solve well with traditional methods. For example, the classic "minimax" solution in game theory is incorrect here because the opponent may have a particular way of playing the game. For example, a minimax player would never reach a game state from which it could lose, even if in fact it always won from that state because of incorrect play by the opponent). Classical optimal methods for continuous decision problems, such as dynamic programming, can compute the optimal solution for any adversary, but require input of the full specification of that adversary, including how much the adversary moves in each board state (don't understand) probability. Let us assume that this prior information is not available in the problem, since it does not apply to most practical problems. On the other hand, such information can be estimated from experience, in this case playing chess with an opponent. It is best to learn the adversary's behavioral patterns to some level of confidence, and then apply dynamic programming to compute an approximate optimal solution to the adversary's model. This is no different from the reinforcement learning methods we study later in this book.

An evolutionary approach to this problem would directly search a policy space to find the one most likely to win. Here, one strategy is to tell the player the rules for how to move in various states of the game - all possible positions for x and o on a 3x3 board. In our considered strategy, the estimated probability of winning will be obtained by playing some games against the opponent. This assessment will determine which strategy or strategies will be considered next. A typical evolutionary approach is to perform a hill-climbing in the policy space, and then sequentially generate and evaluate policies in the process of trying to improve. Alternatively, perhaps a genetic algorithm could be used to maintain and evaluate a set of policies. In fact, we have hundreds of different optimization methods.

The following is a method to solve the Othello problem using the method of the value function. First, we build a table of numbers, each number corresponding to a game state. Each number will be the latest estimate of our win. We treat this estimate as the value of the state, and the entire table is the learned value function. If A's win rate is higher than Ratio, we consider A to be worth more than B, or A is better than B. Suppose we play X, then the probability of winning for all three X's in a column is 1, because we have already won. Likewise, for all three O's in one column, or all three columns that are "filled up", the probability of winning is 0, because we can't win from then. We put all other The initial value of the state is set to 0.5, which means guessing that we have a 50% chance of winning.

​ We played many games against our opponents. To choose our moves, we examine the possible states each move can produce (one in every space on the board) and look up their current values ​​in a table. Most of the time, we move greedily, choosing the most valuable move, that is, with the highest probability of winning. However, occasionally we randomly choose from other actions. These are called exploratory actions because they allow us to experience states we may have never seen before. The sequence of moves and considerations in the game can be plotted in Figure 1.1.

Figure 1.1: A series of tic-tac-toe moves. Solid lines represent actions taken in the game; dashed lines represent actions that we (the reinforcement learning player) considered but did not do. Our second move is an exploratory move, which means that the move represented by e* is hierarchically superior to the current move. Exploratory actions don't lead to any learning, but our other actions do produce updates, resulting in the kind of updates shown by the curved arrows - as written, the bottom-up change in the valuation function.

When we play chess, we keep changing the values ​​of the state according to the discoveries in the game, so that these values ​​can more accurately estimate the probability of winning. To do this, we "back up" the value of the state after each greedy move to the state before the move. More precisely, the current value of the previous state is updated to approximate the value of the subsequent state. This can be done by moving the earlier state's value a fraction of the way toward the value of the later state. If let s denote the state before the move and s' denote the state after the move, then the update of the estimated value of s is denoted as V(s), which can be written as

V(s) ← V(s) +

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324593689&siteId=291194637