本文为强化学习笔记，主要参考以下内容：

An Extended Example: Tic-Tac-Toe (井字棋)

在这里插入图片描述
Because a skilled player can play so as never to lose, let us assume that we are playing against an imperfect player. For the moment, in fact, let us consider draws and losses to be equally bad for us. How might we construct a player that will find the imperfections in its opponent’s play and learn to maximize its chances of winning?

Although this is a simple problem, it cannot readily be solved in a satisfactory way through classical techniques. For example, the classical “minimax” solution is not correct here because it assumes a particular way of playing by the opponent.

Here is how the tic-tac-toe problem would be approached with a method making use of a value function.

First we set up a table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state’s $v a l u e$ , and the whole table is the learned value function.

Assuming we always play $X$ s, then for all states with three $X$ s in a row the probability of winning is $1$ . Similarly, for all states with three $O$ s in a row, or that are filled up," the correct probability is $0$ . We set the initial values of all the other states to $0.5$ .

We play many games against the opponent. To select our moves we examine the states that would result from each of our possible moves and look up their current values in the table. Most of the time we move greedily, selecting the move that leads to the state with greatest value, that is, with the highest estimated probability of winning. Occasionally, however, we select randomly from among the other moves instead. These are called $e x p l o r a t o r y$ moves because they cause us to experience states that we might otherwise never see.

在这里插入图片描述

While we are playing, we change the values of the states in which we find ourselves during the game. We attempt to make them more accurate estimates of the probabilities of winning. To do this, we “back up (回溯更新)” the value of the state after each greedy move to the state before the move, as suggested by the arrows in Figure 1.1. More precisely, the current value of the earlier state is adjusted to be closer to the value of the later state. This can be done by moving the earlier state’s value a fraction of the way toward the value of the later state(将早先状态的价值向后面的状态的价值方向移动一个增量). If we let $s$ denote the state before the greedy move, and $s^{'}$ the state after the move, then the update to the estimated value of $s$ , denoted $V (s)$ , can be written as

在这里插入图片描述
where $\alpha$ is a small positive fraction called the $s t e p - s i z e$ $p a r a m e t e r$ (步长参数), which influences the rate of learning.

This update rule is an example of a $t e m p o r a l - d f f e r e n c e$ learning method (时序差分), so called because its changes are based on a difference between estimates at two different times.

The method described above performs quite well on this task. For example, if the step-size parameter is reduced properly over time, this method (the states’ $v a l u e$ ) converges, for any fixed opponent, to the true probabilities of winning from each state given optimal play by our player. Furthermore, the moves then taken (except on exploratory moves) are in fact the optimal moves against the opponent. In other words, the method converges to an optimal policy for playing the game. If the step-size parameter is not reduced all the way to zero over time, then this player also plays well against opponents that slowly change their way of playing.

It is a striking feature of the reinforcement learning solution that it can achieve the effects of planning and lookahead without using a model of the opponent and without conducting an explicit search over possible sequences of future states and actions.

Code (Python)

代码来自GitHub，中文注释是我自己加的，难免疏漏，还望指正！

#######################################################################
# Copyright (C)                                                       #
# 2016 - 2018 Shangtong Zhang([email protected])           #
# 2016 Jan Hakenberg([email protected])                         #
# 2016 Tian Jun([email protected])                                #
# 2016 Kenta Shimada([email protected])                         #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################

BOARD_ROWS = 3
BOARD_COLS = 3
BOARD_SIZE = BOARD_ROWS * BOARD_COLS

`State` 类

这个类主要负责管理棋盘信息，判断输赢

class State:
    def __init__(self):
        # the board is represented by an n * n array,
        # 1 represents a chessman of the player who moves first,
        # -1 represents a chessman of another player
        # 0 represents an empty position
        self.data = np.zeros((BOARD_ROWS, BOARD_COLS)) # 棋盘
        self.winner = None		# 当前棋局的胜者
        self.hash_val = None 	# 当前棋局对应的 hash 值
        self.end = None			# 当前棋局是否结束

    # compute the hash value for one state, it's unique
    # 将每个棋面都映射成一个 hash 值
    def hash(self):
        if self.hash_val is None:
            self.hash_val = 0
            for i in np.nditer(self.data): # np.nditer 返回一个迭代器
                self.hash_val = self.hash_val * 3 + i + 1
        return self.hash_val

    # check whether a player has won the game, or it's a tie
    # 对每个棋局都只计算一次，结果缓存起来，之后直接利用哈希值查找即可
    def is_end(self):
        if self.end is not None:
            return self.end
        results = []
        # check row
        for i in range(BOARD_ROWS):
            results.append(np.sum(self.data[i, :]))
        # check columns
        for i in range(BOARD_COLS):
            results.append(np.sum(self.data[:, i]))

        # check diagonals
        trace = 0
        reverse_trace = 0
        for i in range(BOARD_ROWS):
            trace += self.data[i, i]
            reverse_trace += self.data[i, BOARD_ROWS - 1 - i]
        results.append(trace)
        results.append(reverse_trace)

        for result in results:
            if result == 3:
                self.winner = 1
                self.end = True
                return self.end
            if result == -3:
                self.winner = -1
                self.end = True
                return self.end

        # whether it's a tie
        sum_values = np.sum(np.abs(self.data))
        if sum_values == BOARD_SIZE:
            self.winner = 0
            self.end = True
            return self.end

        # game is still going on
        self.end = False
        return self.end

    # @symbol: 1 or -1
    # put chessman symbol in position (i, j)
    def next_state(self, i, j, symbol):
        new_state = State()
        new_state.data = np.copy(self.data)
        new_state.data[i, j] = symbol
        return new_state

    # print the board
    def print_state(self):
        for i in range(BOARD_ROWS):
            print('-------------')
            out = '| '
            for j in range(BOARD_COLS):
                if self.data[i, j] == 1:
                    token = '*'
                elif self.data[i, j] == -1:
                    token = 'x'
                else:
                    token = '0'
                out += token + ' | '
            print(out)
        print('-------------')

下面的代码用于直接获取所有可能的棋局，计算每个棋局的信息（棋局是否已经结束、胜者是谁…），然后缓存到字典 all_states中，之后直接通过哈希值就可以快速获取每个棋局的信息了：

def get_all_states_impl(current_state, current_symbol, all_states):
    for i in range(BOARD_ROWS):
        for j in range(BOARD_COLS):
            if current_state.data[i][j] == 0:
                new_state = current_state.next_state(i, j, current_symbol)
                new_hash = new_state.hash()
                if new_hash not in all_states:
                    is_end = new_state.is_end()
                    all_states[new_hash] = (new_state, is_end)
                    if not is_end:
                        get_all_states_impl(new_state, -current_symbol, all_states) # 对手落子


def get_all_states():
    current_symbol = 1 	# 1 为先手
    current_state = State()
    all_states = dict()
    all_states[current_state.hash()] = (current_state, current_state.is_end())
    get_all_states_impl(current_state, current_symbol, all_states)
    return all_states


# all possible board configurations
all_states = get_all_states()

`HumanPlayer` 类

# human interface
# input a number to put a chessman
# | q | w | e |
# | a | s | d |
# | z | x | c |
class HumanPlayer:
    def __init__(self, **kwargs):
        self.symbol = None # 标记先后手
        self.keys = ['q', 'w', 'e', 'a', 's', 'd', 'z', 'x', 'c']
        self.state = None

    def reset(self):
        pass

    def set_state(self, state):
        self.state = state

    def set_symbol(self, symbol):
        self.symbol = symbol

    def act(self):
        self.state.print_state()
        key = input("Input your position:")
        data = self.keys.index(key)
        i = data // BOARD_COLS
        j = data % BOARD_COLS
        return i, j, self.symbol

`Player` 类

# AI player
class Player:
    # @step_size: the step size to update estimations 
    # @epsilon: the probability to explore
    def __init__(self, step_size=0.1, epsilon=0.1):
        self.estimations = dict()	# 这个表就是存储了各个棋局的估算胜率，即 value 
        self.step_size = step_size	# 即步长参数，用于控制回溯更新的步长
        self.epsilon = epsilon		# 探索的概率
        self.states = []			# 记录本局经过的棋局
        self.greedy = []
        self.symbol = 0				# 先手为 1，后手为 -1

    def reset(self):
        self.states = []
        self.greedy = []

    def set_state(self, state):
        self.states.append(state)
        self.greedy.append(True)

    def set_symbol(self, symbol):
        self.symbol = symbol
        # 初始化预测表
        for hash_val in all_states:
            state, is_end = all_states[hash_val]
            if is_end:
                if state.winner == self.symbol:
                    self.estimations[hash_val] = 1.0
                elif state.winner == 0:
                    # we need to distinguish between a tie and a lose
                    self.estimations[hash_val] = 0.5
                else:
                    self.estimations[hash_val] = 0
            else:
                self.estimations[hash_val] = 0.5

    # update value estimation
    def backup(self):
        states = [state.hash() for state in self.states]
		
		# 按逆序对本局所有按贪心原则落子局面的预测值进行更新
        for i in reversed(range(len(states) - 1)): 
            state = states[i] 
            td_error = self.greedy[i] * (
                self.estimations[states[i + 1]] - self.estimations[state]
            )
            self.estimations[state] += self.step_size * td_error

    # choose an action based on the state
    def act(self):
        state = self.states[-1] # 上一个棋局
        next_states = []	# 记录所有可能的落子位置对应棋局的哈希值
        next_positions = [] # 记录所有可能的落子位置
        for i in range(BOARD_ROWS):
            for j in range(BOARD_COLS):
                if state.data[i, j] == 0:
                    next_positions.append([i, j])
                    next_states.append(state.next_state(
                        i, j, self.symbol).hash())

        if np.random.rand() < self.epsilon: # 进行试探落子
            action = next_positions[np.random.randint(len(next_positions))]
            action.append(self.symbol)
            self.greedy[-1] = False # 更新标志位
            return action

		# 按贪心原则落子，选择预测胜率最高的地方落子
        values = []
        for hash_val, pos in zip(next_states, next_positions):
            values.append((self.estimations[hash_val], pos))
        # to select one of the actions of equal value at random due to Python's sort is stable
        np.random.shuffle(values)
        values.sort(key=lambda x: x[0], reverse=True)
        action = values[0][1] # 落子位置
        action.append(self.symbol)
        return action

    def save_policy(self):
        with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'wb') as f:
            pickle.dump(self.estimations, f)

    def load_policy(self):
        with open('policy_%s.bin' % ('first' if self.symbol == 1 else 'second'), 'rb') as f:
            self.estimations = pickle.load(f)

`Judger` 类

class Judger:
    # @player1: the player who will move first, its chessman will be 1
    # @player2: another player with a chessman -1
    def __init__(self, player1, player2):
        self.p1 = player1
        self.p2 = player2
        self.current_player = None
        self.p1_symbol = 1
        self.p2_symbol = -1
        self.p1.set_symbol(self.p1_symbol) 	# 初始化玩家先后手，如果是电脑，还会初始化预测表
        self.p2.set_symbol(self.p2_symbol)
        self.current_state = State() 		# 这个好像没用到

    def reset(self):
        self.p1.reset()
        self.p2.reset()

    def alternate(self): # 双方轮番落子
        while True:
            yield self.p1
            yield self.p2

    # @print_state: if True, print each board during the game
    def play(self, print_state=False):
        alternator = self.alternate() 	# 返回一个生成器
        self.reset() 					# 初始化电脑玩家的状态
        current_state = State() 		# 初始化棋局
        self.p1.set_state(current_state) 	
        self.p2.set_state(current_state)
        if print_state:
            current_state.print_state()
        while True:
            player = next(alternator)	# 获取下一个玩家
            i, j, symbol = player.act()	# 玩家落子
            next_state_hash = current_state.next_state(i, j, symbol).hash() # 获取落子后棋盘的哈希值
            current_state, is_end = all_states[next_state_hash] # 由哈希值获取棋盘信息
            self.p1.set_state(current_state) 	# 记录当前棋局
            self.p2.set_state(current_state)
            if print_state:
                current_state.print_state()
            if is_end:
                return current_state.winner

训练及对局部分

def train(epochs, print_every_n=500):
    player1 = Player(epsilon=0.01) # 训练时采用左右互搏 (self-play)
    player2 = Player(epsilon=0.01)
    judger = Judger(player1, player2)
    player1_win = 0.0
    player2_win = 0.0
    for i in range(1, epochs + 1):
        winner = judger.play(print_state=False)
        if winner == 1:
            player1_win += 1
        if winner == -1:
            player2_win += 1
        if i % print_every_n == 0:
            print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))
        player1.backup()
        player2.backup()
        judger.reset()
    player1.save_policy()
    player2.save_policy()

# 测试 AI
def compete(turns):
    player1 = Player(epsilon=0)
    player2 = Player(epsilon=0)
    judger = Judger(player1, player2)
    player1.load_policy()
    player2.load_policy()
    player1_win = 0.0
    player2_win = 0.0
    for _ in range(turns):
        winner = judger.play()
        if winner == 1:
            player1_win += 1
        if winner == -1:
            player2_win += 1
        judger.reset()
    print('%d turns, player 1 win %.02f, player 2 win %.02f' % (turns, player1_win / turns, player2_win / turns))


# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
    while True:
        player1 = HumanPlayer()
        player2 = Player(epsilon=0) 		# 真实对战时，不进行试探
        judger = Judger(player1, player2)	
        player2.load_policy()
        winner = judger.play()
        if winner == player2.symbol:
            print("You lose!")
        elif winner == player1.symbol:
            print("You win!")
        else:
            print("It is a tie!")


if __name__ == '__main__':
    train(int(1e5))
    compete(int(1e3))
    play()

RL(Chapter 1): Tic-Tac-Toe (井字棋)

目录

An Extended Example: Tic-Tac-Toe (井字棋)

Code (Python)

`State` 类

`HumanPlayer` 类

`Player` 类

`Judger` 类

训练及对局部分

猜你喜欢

RL(Chapter 1): Tic-Tac-Toe (井字棋)

目录

An Extended Example: Tic-Tac-Toe (井字棋)

Code (Python)

State 类

HumanPlayer 类

Player 类

Judger 类

训练及对局部分

猜你喜欢

`State` 类

`HumanPlayer` 类

`Player` 类

`Judger` 类