Reinforcement Learning - Understanding and Application: Solving Maze Problems

What is Reinforcement Learning?

Reinforcement Learning (RL) is a machine learning method designed to allow an agent (agent) to learn how to make the optimal action choice to obtain the greatest cumulative reward through interaction with the environment.

7 basic concepts

Reinforcement learning is mainly composed of Agent, Environment, State, Action, Reward, Policy, and Value.

In reinforcement learning, the agent needs to learn through trial and error, and adjust its behavior by observing the feedback (reward or punishment) of the environment, so as to gradually improve the strategy.

How to understand it? Seeing so many concepts is generally difficult to understand. Let's illustrate with an example: maze game.

The maze is similar to the one in the picture, the black grid is the wall, and it cannot walk. When the mouse tries to go to the wall, it will stop at the same place. The white grid is an open space, you can walk. The blue dots represent the squares that have been walked. The starting position is the upper left corner and the ending position is the lower right corner.

1. Intelligent body

The red start represents the agent , which is playing in the environment of the maze:

The goal of reinforcement learning is to make the red dot intelligent enough. How intelligent is it? Let it find the path from start (starting point) to exit (exit) smoothly, and learn to the end: let it find a suitable path from any starting point to exit from the exit.

2. Environment

Here is the maze. In the maze environment, there are: the initial starting point, the white square represents the grid that can pass, the black grid represents the obstacle, and the green dot represents the exit of the maze. The length of the maze is 8 grids and the width is 8 grids. These elements consist of environment for reinforcement learning.

3. Status

This is more abstract for beginners. In the maze game, the state can be understood as a grid where the red dot is located.

8×8 grid, the upper left corner is the starting point, the row number is 0-7, and the column number is 0-7. Assuming that the agent has reached the red point pointed by the arrow, then the state  of the agent can be abstracted as  ( 7,4)

4. Action

Actions are actions that an agent can perform in a particular state. It can be discrete (e.g. left/right) or continuous (e.g. controlling force or position of a robotic arm).

  In the maze game, when the agent state is  (7,4)  , it has only two possible actions: up and right, as shown by the red arrow in Figure 2, and the action values ​​are discrete.

5. Rewards

The reward is the feedback signal given by the environment to the behavior of the agent. It is used to evaluate the agent's behavior is good or bad, and as a learning signal to guide the agent's decision-making.

In the maze game, if the agent's current state is  (7,4)  , and its previous state is  (6,4)  , because at this time it has two action choices, up or right.

If it moves up, it means repeating the original path, we have to give it a penalty reward, try not to repeat walking; on the contrary, if it goes to the right, we give it a better reward than going up, so this is Make the agent more inclined to choose to go to the right.

6. Strategy

A policy defines how an agent chooses an action in a given state. This concept is also relatively abstract. What does strategy mean?

Give a commonly used strategy: ε-greedy strategy (ε-greedy).

When selecting an action, the strategy 1-εselects the current optimal action with probability , and εselects a random action with probability . That is to say, when the current state of the agent is  (7,4)  , it may move up again in the next state, although in the current environment, it is not wise to move up directly. However, for other cases, random selection of actions may yield unexpectedly good results.

The detailed algorithm will be mentioned in the next section (xxx).

7. Value function 

The value function is used to evaluate the value of the state or state-action pair, which represents the expected value of the long-term cumulative reward that the agent can obtain from the state or state-action pair.

In more general terms, the value function is to give you a state of the agent and return its cumulative reward value. The deep learning network model can be used to approximate the value function, for example: let the neural network input the state and output the reward value under each action.

The detailed algorithm will be mentioned in the next section (xxx).

Markov decision process

Markov decision process (Markov Decision Process, MDP), MDP provides a mathematical framework for describing sequential decision-making problems, which is one of the foundations of reinforcement learning.

It models the decision-making problem as a combination of states , actions , transition probabilities , and rewards , and finds the optimal decision-making strategy by optimizing the goal of accumulating rewards .

The MDP contains the following elements:

  • State: The different states that a system or environment may be in.
  • Action: An optional decision or action in each state.
  • Transition Probability (Transition Probability): After performing an action, the probability distribution of the system transitioning from one state to another.
  • Reward: The instant reward obtained after performing an action in each state.
  • Policy: A strategy for selecting actions based on the current state.

We still understand through the maze problem.

1. State

In this example, the state is the coordinates of where the agent is located, which is a cell in the maze. For example, states can be (x, y)represented using coordinates, where xand yare the row and column indices of a cell in the maze.

The state can be represented as a two-dimensional coordinate  (x, y), where  x the row index of the maze is represented and y the column index of the maze is represented. Assuming that the size of the maze is  N × M, the state set is:

S=\{(x,y), x\in [0,N),y\in [0,M)\}

2. Action

Actions are the actions an agent can take in a certain state, i.e. move up, down, left or right. Symbols may be used (u,d,l,r)to indicate corresponding actions.

3. Transition Probability

The transition probability describes the probability distribution of the agent moving to the next state after performing an action in a certain state.

In a maze game, transition probabilities are deterministic because the agent moves exactly to the next state after performing an action. For example, if the agent (x, y)performs an upward action in state , then the next state will be (x, y-1), with transition probability 1.

Since moving in the maze is deterministic, transition probabilities can be expressed as a function

T_{sas^{'}}=P(S_{t+1}=s^{'}|S_{t}=s,A=a)\rightarrow [0,1]

where is the probability of  transitioning to a state after   performing an action in  T_{sas^{'}}the state   .sas'

According to the maze rule, if the agent performs an action in state  a, then the next state  s' can be calculated according to action a, for example:

  • if a=u thens^{'}=(x-1,y)
  • if a=d thens^{'}=(x+1,y)
  • if  a=l thens^{'}=(x,y-1)
  • if  a=r thens^{'}=(x-1,y+1)

Among them, in the boundary case, if the agent tries to move to a position outside the maze or to a wall position, the transition probability is 0.

4. Reward

A reward is the immediate feedback an agent gets after performing an action.

In the maze game, the following reward mechanisms can be set:

  • Gain a positive reward (eg +10) when the agent moves to the treasure location.
  • Get a negative reward (eg -20) when the agent moves to the wall position.
  • In other cases, a small negative reward (e.g. -0.01) is given to encourage finding the treasure as quickly as possible.

The reward function can be expressed as a function:

R_{sas^{'}}=C(S_{t+1}=s^{'}|S_{t}=s,A=a)

where R_{sas^{'}}denotes  the immediate reward  for transitioning to state s' after s performing an action in  state s' .a

According to the setting of the maze, the following rewards are defined:

  • If s' is a treasure location, thenR_{sas^{'}}=10

  • If s' is the wall position, thenR_{sas^{'}}=-20

  • otherwise,R_{sas^{'}}=-0.01

policy iteration

Policy iteration is a solution method in Markov decision process (MDP), and it is also a common solution method for reinforcement learning.

Still taking the maze game as an example, the goal is to find the exit of the maze. Every time you reach a certain position in a maze, you need to choose an action (up, down, left, right) to move according to the current state (position).

You want to find an "optimal strategy" that chooses the best action at each position to find the exit of the maze as quickly as possible. The idea of ​​strategy iteration is also very straightforward, which is to find the optimal strategy by continuously "improving the strategy" . Therefore, strategy iteration is mainly divided into two steps: strategy evaluation and strategy improvement .

strategy evaluation

Evaluate the current policy and calculate the value function of each state (representing the expected cumulative reward that can be obtained in this state). The value function of each state is calculated iteratively until the value function converges.

It may be difficult to understand, let's take the palace game as an example to understand:

Define the maze state space size and action space size as 64 and 4 respectively, that is, in 8*8the grid, there are 4 kinds of actions, up, down, left, and right.

num_states = 64
num_actions = 4

So there is a strategy, a two-dimensional array, that is, the value probability of the 4 actions corresponding to each state.

policy = np.ones((num_states, num_actions)) / num_actions

The policy iteration method also has a value function. The input parameter of the value function is the state, and the return value is the value. The value of the initial state is 0.

values = np.zeros(num_states)

Define the reward matrix for the maze:

rewards = np.zeros((8, 8)) - 0.01
rewards[0, 2] = -20
rewards[0, 6] = -20
rewards[1, 1] = -20
rewards[1, 7] = -20
rewards[2, 5] = -20
rewards[3, 1] = -20
rewards[3, 4] = -20
rewards[3, 5] = -20
rewards[3, 7] = -20
rewards[4, 1] = -20
rewards[4, 4] = -20
rewards[5, 0] = -20
rewards[5, 2] = -20
rewards[5, 4] = -20
rewards[5, 6] = -20
rewards[5, 7] = -20
rewards[6, 4] = -20
rewards[7, 2] = -20
rewards[7, 7] = 10

So the code for policy evaluation is:

def policy_evaluation():
    delta = 1e-6  # 停止迭代的阈值
    max_iterations = 1000  # 最大迭代次数
    for _ in range(max_iterations):
        new_values = np.zeros(num_states)
        for s in range(num_states):
            value = 0
            for a in range(num_actions):
                next_state = get_next_state(s, a)  # 获取下一个状态
                value += policy[s][a] * (rewards[s][a] + values[next_state])  # 贝尔曼方程:四种动作的概率值和
            new_values[s] = value
        if np.max(np.abs(new_values - values)) < delta:
            break
        values = new_values

The value function calculation is the Bellman equation, which is the basic equation in dynamic programming and reinforcement learning, proposed by Richard Bellman.

The Bellman equation expresses the relationship between the value of a state or state-action pair and the expected reward obtained by following a particular strategy.

The general form of the Bellman equation is as follows:

V(s)=max_{a}\left \{ \sum_{s^{'}r}^{} p(s^{'},r|s,a)\left [ r+\gamma V(s^{'}) \right ]\right \} 

Among them,  V(s)represents  s the value function of the state, that is, the expected return obtained according to a certain strategy. max_{a}Indicates choosing the action that maximizes the value a. \sum_{s^{'}r}^{}means to sum over all possible next states s' and reward r. p(s^{'},r|s,a)Indicates the probability of transitioning to state s' and obtaining reward r after performing action a in state s. \gammais the discount factor used to balance current and future rewards. 

strategy improvement

policyIt is a [num_states, num_actions]two-dimensional array. In the step of policy improvement, it is actually to continuously update the optimal action under each state, which is to update the value of the second dimension of the policy two-dimensional array num_actions.

Pseudocode: Updating  policy an Array of Strategies

def policy_improvement():
    for s in range(num_states):
        q_values = np.zeros(num_actions)
        for a in range(num_actions):
            next_state = get_next_state(s, a)  # 获取下一个状态
            q_values[a] = rewards[s][a] + values[next_state]
        best_action = np.argmax(q_values)
        new_policy = np.zeros(num_actions)
        new_policy[best_action] = 1
        policy[s] = new_policy

Combine the above two steps to get the strategy iteration algorithm.

def policy_iteration():
    max_iterations = 1000  # 最大迭代次数
    for _ in range(max_iterations):
        policy_evaluation()  # 策略评估
        policy_improvement()  # 策略改进

In summary, policy iteration is an algorithm that solves the Markov decision process by repeatedly evaluating and improving the policy. It finds the optimal policy by continuously optimizing the policy and value function, and helps us make the best decisions in problems such as maze games.

value iteration

Value iteration is another solution method in reinforcement learning, which is used to find the optimal value function in Markov decision process (MDP).

Value iteration can be summarized as follows:

  • Value iteration approaches the optimal value function by iteratively updating the value function to determine the optimal strategy.
  • The key to value iteration is to update the value function in each iteration.
  • For each state, the action that maximizes the value is chosen by considering all possible actions and the next state, and an updated value function is computed.
  • The iterative update value function, the update formula is also the Bellman equation, which is the same as the strategy iteration value function update formula.
  • Value iteration requires multiple iterations until the value function converges. At convergence, the value function no longer changes significantly.

Therefore: value iteration is a simpler iteration method than policy iteration.

def policy_evaluation():
    # 定义参数
    gamma = 0.9  # 折扣因子
    epsilon = 1e-6  # 收敛阈值
    # 初始化价值函数
    f_values = np.zeros(grid.shape)
    # 动作集合
    actions = [(0, 1), (0, -1), (1, 0), (-1, 0)]

    # 进行值迭代
    while True:
        delta = 0
        n, m = grid.shape
        for i in range(n):
            for j in range(m):
                if grid[i, j] == -5 or grid[i, j] == 10:
                    continue
                # 计算当前状态的最大价值
                max_value = -np.Inf
                for x, y in actions:
                    ni, nj = i + x, j + y
                    # 边界校验 + 是否是墙校验
                    if 0 <= ni < grid.shape[0] and 0 <= nj < grid.shape[1] and grid[ni, nj] != -5:
                        max_value = max(max_value, gamma * f_values[ni, nj])
                # 更新价值函数
                new_value = grid[i, j] + max_value
                delta = max(delta, abs(new_value - f_values[i, j]))
                f_values[i, j] = new_value
        if delta < epsilon:
            break
    print(f"最优价值函数:{f_values}")

maze game app

Policy value definition:

Policy Values ​​is a table that stores the estimated value of each state-action pair. For a given state sand action , the P-value represents an estimate of the long-run reward obtained by performing the action ain the state .sa

The P value is updated in an iterative manner, and the optimal policy is gradually approached by continuously updating the policy value. The update rules are as follows:

P(s,a)=(1-\alpha )*P(s,a)+\alpha *(r+\gamma *max_{a}^{'}P(s^{'},a^{'})) 

Among them, represents the value of  performing an action  P(s,a)in the state  , is the learning rate (0 < α <= 1), r is the immediate reward obtained after performing the action , is the discount factor (0 <= <= 1), and is transferred to The next state of is the action selected in the next state, indicating that the action with the largest value is selected among all possible actions in the next state .sa\alphaa\gamma\gammas^{'}aa^{'}max_{a}^{'}P(s^{'},a^{'})s^{'}

The meaning of the update rule is to gradually converge the P value to the optimal value by weighting the current P value and the newly estimated P value. Among them, \alphathe weight of the new estimated value is controlled, and the emphasis on future returns is controlled.

By continuously executing the update rules, the reinforcement learning algorithm can gradually learn the optimal P value, and select the best action according to the P value to achieve the optimal strategy.

import numpy as np


def get_possible_actions(row_num, clo_num, row_n, col_n):
    target_actions = [0, 1, 2, 3]  # 上、下、左、右
    if row_num == 0:  # 不能向上
        target_actions.remove(0)
    if clo_num == 0:  # 不能向左
        target_actions.remove(2)
    if row_num == row_n - 1:  # 不能向下
        target_actions.remove(1)
    if clo_num == col_n - 1:  # 不能向右
        target_actions.remove(3)

    return target_actions


def get_next_state(state, action):
    row_num, clo_num = state
    next_state = state
    if action == 0:  # 上
        next_state = (row_num - 1, clo_num)
    elif action == 1:  # 下
        next_state = (row_num + 1, clo_num)
    elif action == 2:  # 左
        next_state = (row_num, clo_num - 1)
    elif action == 3:  # 右
        next_state = (row_num, clo_num + 1)
    return next_state


def get_best_reward_route(grid, begin_cord, exit_coord, max_iterations):
    """
    获取最优奖励路径
    :param grid: 网格奖励
    :param begin_cord: 开始位置
    :param exit_coord: 结束位置
    :param max_iterations: 最大迭代次数
    :return: 最有路径及最大奖励
    """
    action_n = 4
    row_n, col_n = grid.shape

    alpha = 0.1  # 学习率
    gamma = 0.9  # 折扣因子
    epsilon = 0.3  # ε-greedy策略的ε值

    # 初始化策略P表
    policy = np.zeros((row_n, col_n, action_n))

    best_route = []
    max_route_reward = -np.Inf
    for n_iter in range(max_iterations):
        # 初始化起始位置
        state = begin_cord
        route = [state]
        while state != exit_coord:  # 终止条件:到达终点位置
            row_num, clo_num = state
            # 获取动作集合
            possible_actions = get_possible_actions(row_num, clo_num, row_n, col_n)
            # 选择动作
            if np.random.uniform() < epsilon:

                action = np.random.choice(possible_actions)  # ε-greedy策略,以一定概率随机选择动作
            else:
                action = possible_actions[np.argmax(policy[row_num, clo_num, possible_actions])]  # 选择Q值最大的动作
            # 执行动作,更新状态
            next_state = get_next_state(state, action)

            # 获取即时奖励
            reward = grid[next_state]

            # 更新策略P值
            policy[state][action] = (1 - alpha) * policy[state][action] + alpha * (reward + gamma * np.max(policy[next_state]))

            # 更新状态
            state = next_state
            route.append(state)

        route_reward = sum(grid[state] for state in route)
        if max_route_reward < route_reward:
            max_route_reward = route_reward
            best_route = route.copy()
            print(f"iteration: {n_iter}, max_reward_route:{max_route_reward}, best_route:{best_route}")

        route.clear()

    print('-' * 100)
    return best_route, max_route_reward


if __name__ == '__main__':
    # 创建迷宫地图
    grid = np.zeros((8, 8)) - 0.001
    # 起始位置
    begin_cord = (0, 0)
    # 结束位置
    exit_coord = (7, 7)

    # 走出迷宫奖励10个积分
    grid[exit_coord] = 10
    # 走到墙网格,扣除20个积分
    grid[0, 2] = -20
    grid[0, 6] = -20
    grid[1, 1] = -20
    grid[1, 7] = -20
    grid[2, 5] = -20
    grid[3, 1] = -20
    grid[3, 4] = -20
    grid[3, 5] = -20
    grid[3, 7] = -20
    grid[4, 1] = -20
    grid[4, 4] = -20
    grid[5, 0] = -20
    grid[5, 2] = -20
    grid[5, 4] = -20
    grid[5, 6] = -20
    grid[5, 7] = -20
    grid[6, 4] = -20
    grid[7, 2] = -20
    print(grid)
    print('-' * 100)
    max_reward_route, best_route = get_best_reward_route(grid, begin_cord, exit_coord, max_iterations=200)
    print(f"max_reward_route:{max_reward_route}\nbest_route:{best_route}\n")

result:

max_reward_route:[(0, 0), (1, 0), (2, 0), (2, 1), (2, 2), (2, 3), (3, 3), (4, 3), (5, 3), (6, 3), (7, 3), (7, 4), (7, 5), (7, 6), (7, 7)]
best_route:9.986

Of course, the result is not unique, there are many paths, and the same reward.

Debugging tips:

Sometimes, if you fall into local optimum, you can increase the ε value of the ε-greedy strategy. In this paper, 0.1->0.3

Sometimes, the convergence is slow, you can adjust appropriately to increase the penalty score of the wall, and reduce the reward score of the blank space, but it must be less than 0


reference: 

Artificial intelligence basic homework - Reinforcement learning to solve maze problems- Know about

The second lecture Markov decision process - know almost

Programmer Guo Zhen: https://mp.weixin.qq.com/mp/appmsgalbum?action=getalbum&__biz=MzI3NTkyMjA4NA==&scene=24&album_id=2931825580365643777&count=3#wechat_redirect

MDPs (Markov Decision Processes) - Programmer Sought

Guess you like

Origin blog.csdn.net/qq_19446965/article/details/131619691