Description of the Cliff-Walking problem

insert image description here

Cliff walking: walk from S to G, where the gray part is that the cliff is unreachable.
In the modeling of the feasible solution, the reward for falling off the cliff is -100, the reward for G is 10, the reward for staying still is -1, and the reward for reaching non- The reward at the end position is 0 (inconsistent with the schematic diagram in the figure, but the difference is not bad). Using the Sarsa strategy of the on-track strategy and the Q-learning algorithm of the off-track strategy, safe path and optimal path are obtained after 20,000 evolutionary iterations. Finally, the final strategy is obtained according to the Q value, so as to reproduce the above picture

Comparison of Sarsa and Q-Learning Algorithms

Sarsa Algorithm
insert image description here
Q-Learning Algorithm

insert image description here The first thing to introduce is ε-greedy, that is, the ε-greedy algorithm. Generally, ε is set as a small value between 0-1 (such as 0.2). When the
algorithm is in progress, a pseudo-random number is generated by the computer. , when the random number is less than ε, the principle of arbitrary equal probability selection is adopted, and when it is greater than ε, the optimal action is taken.

After introducing the two algorithms and the ε-greedy algorithm, in a nutshell, Sarsa’s choice of a in the current state s is ε-greedy, and the choice of a’ in s’ is also ε-greedy Q-Learning Same as sarsa, but the choice of a' for s' is directly the largest.

code sharing

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches  # 图形类

np.random.seed(2022)


class Agent():
    terminal_state = np.arange(36, 48)  # 终止状态

    def __init__(self, board_rows, board_cols, actions_num, epsilon=0.2, gamma=0.9, alpha=0.1):
        self.board_rows = board_rows
        self.board_cols = board_cols
        self.states_num = board_rows * board_cols
        self.actions_num = actions_num
        self.epsilon = epsilon
        self.gamma = gamma
        self.alpha = alpha
        self.board = self.create_board()
        self.rewards = self.create_rewards()
        self.qtable = self.create_qtable()

    def create_board(self):  # 创建面板
        board = np.zeros((self.board_rows, self.board_cols))
        board[3][11] = 1
        board[3][1:11] = -1
        return board

    def create_rewards(self):  # 创建奖励表
        rewards = np.zeros((self.board_rows, self.board_cols))
        rewards[3][11] = 10
        rewards[3][1:11] = -100
        return rewards

    def create_qtable(self):  # 创建Q值
        qtable = np.zeros((self.states_num, self.actions_num))
        return qtable

    def change_axis_to_state(self, axis):  # 将坐标转化为状态
        return axis[0] * self.board_cols + axis[1]

    def change_state_to_axis(self, state):  # 将状态转化为坐标
        return state // self.board_cols, state % self.board_cols

    def choose_action(self, state):  # 选择动作并返回下一个状态
        if np.random.uniform(0, 1) <= self.epsilon:
            action = np.random.choice(self.actions_num)
        else:
            p = self.qtable[state, :]
            action = np.random.choice(np.where(p == p.max())[0])

        r, c = self.change_state_to_axis(state)
        new_r = r
        new_c = c

        flag = 0

        #状态未改变
        if action == 0:  # 上
            new_r = max(r - 1, 0)
            if new_r == r:
                flag = 1
        elif action == 1:  # 下
            new_r = min(r + 1, self.board_rows - 1)
            if new_r == r:
                flag = 1
        elif action == 2:  # 左
            new_c = max(c - 1, 0)
            if new_c == c:
                flag = 1
        elif action == 3:  # 右
            new_c = min(c + 1, self.board_cols - 1)
            if new_c == c:
                flag = 1

        r = new_r
        c = new_c
        if flag:
            reward = -1 + self.rewards[r,c]
        else:
            reward = self.rewards[r, c]

        next_state = self.change_axis_to_state((r, c))
        return action, next_state, reward


    def learn(self, s, r, a, s_,sarsa_or_q):
        # s状态，a动作，r即时奖励，s_演化的下一个动作
        q_old = self.qtable[s, a]
        # row,col = self.change_state_to_axis(s_)
        done = False
        if s_ in self.terminal_state:
            q_new = r
            done = True
        else:
            if sarsa_or_q == 0:
                if np.random.uniform(0.1) <= self.epsilon:
                    s_a = np.random.choice(self.actions_num)
                    q_new = r + self.gamma * self.qtable[s_, s_a]
                else:
                    q_new = r + self.gamma * max(self.qtable[s_, :])
            else:
                q_new = r + self.gamma * max(self.qtable[s_, :])
                # print(q_new)
        self.qtable[s, a] += self.alpha * (q_new - q_old)
        return done


    def initilize(self):
        start_pos = (3, 0)  # 从左下角出发
        self.cur_state = self.change_axis_to_state(start_pos)  # 当前状态
        return self.cur_state


    def show(self,sarsa_or_q):
        fig_size = (12, 8)
        fig, ax0 = plt.subplots(1, 1, figsize=fig_size)
        a_shift = [(0, 0.3), (0, -.4),(-.3, 0),(0.4, 0)]
        ax0.axis('off')  # 把横坐标关闭
        # 画网格线
        for i in range(self.board_cols + 1):  # 按列画线
            if i == 0 or i == self.board_cols:
                ax0.plot([i, i], [0, self.board_rows], color='black')
            else:
                ax0.plot([i, i], [0, self.board_rows], alpha=0.7,
                     color='grey', linestyle='dashed')

        for i in range(self.board_rows + 1):  # 按行画线
            if i == 0 or i == self.board_rows:
                ax0.plot([0, self.board_cols], [i, i], color='black')
            else:
                ax0.plot([0, self.board_cols], [i, i], alpha=0.7,
                         color='grey', linestyle='dashed')

        for i in range(self.board_rows):
            for j in range(self.board_cols):

                y = (self.board_rows - 1 - i)
                x = j

                if self.board[i, j] == -1:
                    rect = patches.Rectangle(
                        (x, y), 1, 1, edgecolor='none', facecolor='black', alpha=0.6)
                    ax0.add_patch(rect)
                elif self.board[i, j] == 1:
                    rect = patches.Rectangle(
                        (x, y), 1, 1, edgecolor='none', facecolor='red', alpha=0.6)
                    ax0.add_patch(rect)
                    ax0.text(x + 0.4, y + 0.5, "r = +10")

                else:
                    # qtable
                    s = self.change_axis_to_state((i, j))
                    qs = agent.qtable[s, :]
                    for a in range(len(qs)):
                        dx, dy = a_shift[a]
                        c = 'k'
                        q = qs[a]
                        if q > 0:
                            c = 'r'
                        elif q < 0:
                            c = 'g'
                        ax0.text(x + dx + 0.3, y + dy + 0.5,
                                 "{:.1f}".format(qs[a]), c=c)

        if sarsa_or_q == 0:
            ax0.set_title("Sarsa")
        else:
            ax0.set_title("Q-learning")
        if sarsa_or_q == 0:
            plt.savefig("Sarsa")
        else:
            plt.savefig("Q-Learning")
        plt.show(block=False)
        plt.pause(5)
        plt.close()

Add the following paragraph to make the program run!

agent = Agent(4, 12, 4)
maxgen = 20000
gen = 1
sarsa_or_q = 0
while gen < maxgen:
    current_state = agent.initilize()
    while True:
        action, next_state, reward = agent.choose_action(current_state)
        done = agent.learn(current_state, reward, action, next_state,sarsa_or_q)
        current_state = next_state
        if done:
            break

    gen += 1

agent.show(sarsa_or_q)
print(agent.qtable)

Set sarsa_or_q to 0 and 1 respectively to view the schematic diagram of the results calculated by different methods.
According to the Q value, the final convergence strategy can be obtained.
insert image description here

Areas for improvement

The convergence of the code iteration is too slow. The code written by the author took 20,000 iterations to converge. This is inconsistent with the result of convergence in the course of about 100 scenes. The efficiency of the algorithm still needs to be improved. It is worth adding that the convergence of about 100 scenes has not been achieved in the iterative maximum algebra, so when simulating, simply choose 20,000 times, maybe it will converge in advance.
What can be improved: build the model, because the previous code is model-free, and setting up a model to guide the strategy will get better results. Of course, it may also make the problem fall into local exploration. This is something that needs to be discussed for further study .
Combination with scientific research: In terms of research direction, if you want to combine it, you need to learn how to deal with multiple individuals learning at the same time in the environment
insert image description here

Quoted and written at the end

Cliff-Walking simulation isReinforcement Learning Course by David Silver
The address of the example in the fifth lecture in the course is here
. Record that the study of the intensive learning course is temporarily completed, and the flowers are scattered, da da!

Contrastive experiment of Sarsa of reinforcement learning and Cliff-Walking of Q-Learning