The robot automatically walks the maze based on Python

Resource download address : https://download.csdn.net/download/sheziqiong/85631466

The robot automatically walks the maze

A topic background

1.1 Experimental topic

In this experiment, it is required to use the basic search algorithm and the Deep QLearning algorithm to complete the automatic maze of the robot.

Figure 1 map (size10)

As shown in the figure above, the red ellipse in the upper left corner is both the starting point and the initial position of the robot, and the green square in the lower right corner is the exit.

The rules of the game are: start from the starting point, go through the intricate maze, and reach the target point (exit).

Actions that can be performed at any position include: go up 'u', go right 'r', go down 'd', go left 'l'.
After performing different actions, different rewards will be obtained according to different situations. Specifically, there are the following situations.
- hit the wall
- walk to the exit
- other cases
You are required to implement the robot based on the basic search algorithm and the Deep QLearning algorithm , so that the robot can automatically go to the exit of the maze.

1.2 Experimental requirements

Use the Python language.
Use the basic search algorithm to complete the robot's maze.
Use the Deep QLearning algorithm to complete the robot maze.
The algorithm part needs to be implemented by itself, and ready-made packages, tools or interfaces cannot be used.

1.3 The experiment uses important python packages

import os
import random
import numpy as np
import torch

Introduction to Two Mazes

Through the maze class Maze can create a maze randomly.

Use Maze(maze_size=size) to randomly generate a size * size maze.
Use the print() function to output the size of the maze and draw the maze map
The red circle is the initial position of the robot
The green square is the exit position of the maze

Figure 2 gif map (size10)

The important member methods in the Maze class are as follows:

sense_robot() : Get the current position of the robot in the maze.

return: The current position of the robot in the maze.

move_robot(direction) : Move the default robot according to the input direction, and return an error message if the direction is illegal.

direction: direction of movement, such as: "u", legal value: ['u', 'r', 'd', 'l']

return: the reward value for executing the action

can_move_actions(position): Get the direction in which the current robot can move

position: coordinate point anywhere in the maze

return: the action that can be performed at this point, such as: ['u','r','d']

is_hit_wall(self, location, direction): Determine whether the moving direction hits a wall

location, direction: the current location and the direction to move, such as (0,0), "u"

return: True (hit the wall) / False (do not hit the wall)

draw_maze(): draw the current maze

Three Algorithms Introduction

3.1 Depth-first algorithm

Algorithm specific steps:

Select a vertex $V_i in the graph$ As the starting point, visit and mark the vertex;
With Vi as the current vertex, search for $V_i in turn$ Each adjacent point $V_j$ , if $V_j$ has not been visited, then visit and mark the adjacent point $V_j$ , if $V_j$ has been visited, then search for $V_i$ the next adjacent point of ;
With $V_j$ For the current vertex, repeat the previous step), until the graph and $V_i$ Vertices with interlinked paths are all visited;
If there are still vertices in the graph that have not been visited (in the case of non-connectivity), you can take an unvisited vertex in the graph as the starting point, and repeat the above process until all vertices in the graph are visited.

time complexity:

The time required to find the neighbors of each vertex is $O(n^2)$ , n is the number of vertices, and the time complexity of the algorithm is $O(n^2)$

3.2 Reinforcement learning QLearning algorithm

Q-Learning is a Value Iteration algorithm. Different from the Policy Iteration (Policy Iteration) algorithm, the value iteration algorithm will calculate the value (Value) or utility (Utility) of each "state" or "state-action", and then try to maximize the this value. Therefore, the accurate estimation of each state value is the core of the value iteration algorithm. Usually, the long-term reward of maximizing the action is considered, that is, not only the reward brought by the current action is considered, but also the long-term reward of the action is considered.

3.2.1 Q value calculation and iteration

The Q-learning algorithm builds the state (state) and action (action) into a Q_table table to store the Q value. The row of the Q table represents the state (state), and the column represents the action (action):

In the Q-Learning algorithm, this long-term reward is recorded as the Q value, which will consider the Q value of each "state-action". Specifically, its calculation formula is: Q ( st , a ) = R t
$Q(s_{t},a) = R_{t+1} + \gamma \times\max_a Q(a,s_{t+1})$

That is, for the current "state-action" $s_{t},a)$ , consider performing action $a$ post-environmental reward $R_{t+1}$ , and execute action $a$ reaches $s_{t+1}$ After that, the maximum Q value $\max_a Q(a,s_{t+1}) that can be obtained by performing any action$ ， $\gamma$ is the discount factor.

After the new Q value is calculated, a more conservative method of updating the Q table is generally used, that is, the introduction of the slack variable $a l p h a$ is updated according to the following formula, which makes the iterative change of Q table more gentle.
$Q(s_{t},a ) = (1-\alpha) \times Q(s_{t},a) + \alpha \times(R_{t+1} + \gamma \times\max_a Q(a,s_{t+1}))$

3.2.2 Selection of robot actions

In reinforcement learning, the exploration-exploitation problem is a very important problem. Specifically, according to the above definition, the robot will try its best to choose the optimal decision every time to maximize the long-term reward. But doing so has the following disadvantages:

In the initial learning, the Q value is not accurate, if you choose according to the Q value at this time, it will cause mistakes.
After learning for a period of time, the route of the robot will be relatively fixed, so the robot cannot effectively explore the environment.

Therefore, a method is needed to solve the above problems and increase the exploration of the robot. Usually the epsilon-greedy algorithm is used :

When the robot chooses an action, the action is randomly selected with a part of the probability, and the action is selected according to the optimal Q value with a part of the probability.
At the same time, the probability of choosing a random action should gradually decrease with the training process.

3.2.3 Learning process of Q-Learning algorithm

3.2.4 Robot class

The QRobot class is provided in this assignment , which implements the Q-table iteration and robot action selection strategy, which can be from QRobot import QRobotused by import.

Core member methods of the QRobot class

sense_state(): Get the current location of the robot

return: the position coordinates of the robot, such as: (0, 0)

current_state_valid_actions(): Get the actions that the current robot can legally move

return: A list of currently legal actions, such as: ['u','r']

train_update(): Execute actions according to the QLearning algorithm strategy in the training state

return: the currently selected action, and the reward for executing the current action, such as: 'u', -1

test_update(): Execute actions according to the QLearning algorithm strategy in the test state

return: the currently selected action, and the return obtained by executing the current action, such as: 'u', -1

reset()

return: reset the position of the robot in the maze

3.2.5 Runner class

The QRobot class implements the Q value iteration and action selection strategy of the QLearning algorithm. During the training process of the robot automatically walking the maze, it is necessary to continuously use the QLearning algorithm to iteratively update the Q value table to achieve an "optimal" state, so a class Runner is packagedfor robot training and visualization. Available viafrom Runner import Runnerimport.

Core member methods of the Runner class:

run_training(training_epoch, training_per_epoch=150): Train the robot, constantly update the Q table, and save the training result in the member variable train_robot_record

training_epoch, training_per_epoch: the total number of training times, the maximum number of steps the robot moves each time

run_testing(): Test whether the robot can get out of the maze
generate_gif(filename): output the training result to the specified gif image

filename: legal file path, the file name needs to be suffixed .gifwith

plot_results(): Display the indicators during the training process in a graph: Success Times, Accumulated Rewards, Running Times per Epoch

3.3 DQNs

The DQN algorithm uses a neural network to approximate the value function, and the algorithm block diagram is as follows.

In this experiment, the provided neural network is used to predict the evaluation scores of four actions and output the evaluation scores at the same time.

Core member methods of the ReplayDataSet class

add(self, state, action_index, reward, next_state, is_terminal) add a piece of training data

state: current robot position

action_index: Select the index to execute the action

reward: the reward for performing the action

next_state: the position of the robot after performing the action

is_terminal: Whether the robot has reached the terminal node (reached the end or hit the wall)

random_sample(self, batch_size): Randomly extract fixed batch_size data from the dataset

batch_size: integer, not allowed to exceed the number of data in the dataset

build_full_view(self, maze): Open the cheat to get the full view

maze: an object instantiated with the Maze class

Four solution results

4.1 Depth First

Write a depth-first search algorithm and test it, and use the stack to iterate layer by layer, and finally search out the path. The main process is that the entry node is used as the root node, and then check whether the node has been explored and whether there are child nodes. If the conditions are met, the node will be expanded, and the child nodes of the node will be pushed into the stack in order. If a node is explored, but the node is not the end point and there is no child node that can be expanded, then pop this point out of the stack and operate in a loop until the end point is found.

The test results are as follows:

If maze_size=5, run the basic search algorithm, the final result is as follows:

搜索出的路径： ['r', 'd', 'r', 'd', 'd', 'r', 'r', 'd']
恭喜你，到达了目标点
Maze of size (5, 5)

Figure 3 Basic search map (size5)

If maze_size=10, run the basic search algorithm, the final result is as follows:

搜索出的路径： ['r', 'r', 'r', 'r', 'r', 'r', 'r', 'd', 'r', 'd', 'd', 'd', 'r', 'd', 'd', 'd', 'l', 'd', 'd', 'r']
恭喜你，到达了目标点
Maze of size (10, 10)

Figure 4 Basic search map (size10)

If maze_size=20, run the basic search algorithm, the final result is as follows:

搜索出的路径： ['d', 'r', 'u', 'r', 'r', 'r', 'r', 'd', 'r', 'd', 'r', 'r', 'r', 'r', 'd', 'd', 'r', 'd', 'd', 'd', 'd', 'r', 'r', 'r', 'r', 'r', 'd', 'r', 'r', 'd', 'r', 'd', 'd', 'l', 'l', 'd', 'd', 'd', 'd', 'd', 'r', 'd', 'd', 'r']
恭喜你，到达了目标点
Maze of size (20, 20)

Figure 5 Basic search map (size20)

Part of the code is as follows:

def myDFS(maze):
        """
        对迷宫进行深度优先搜索
        :param maze: 待搜索的maze对象
        """
        start = maze.sense_robot()
        root = SearchTree(loc=start)
        queue = [root]  # 节点堆栈，用于层次遍历
        h, w, _ = maze.maze_data.shape
        is_visit_m = np.zeros((h, w), dtype=np.int)  # 标记迷宫的各个位置是否被访问过
        path = []  # 记录路径
        peek = 0
        while True:
            current_node = queue[peek]  # 栈顶元素作为当前节点
            #is_visit_m[current_node.loc] = 1  # 标记当前节点位置已访问
            if current_node.loc == maze.destination:  # 到达目标点
                path = back_propagation(current_node)
                break
            if current_node.is_leaf() and is_visit_m[current_node.loc] == 0:  # 如果该点存在叶子节点且未拓展
                is_visit_m[current_node.loc] = 1  # 标记该点已拓展
                child_number = expand(maze, is_visit_m, current_node)
                peek+=child_number  # 开展一些列入栈操作
                for child in current_node.children:
                    queue.append(child)  # 叶子节点入栈
            else:
                queue.pop(peek)  # 如果无路可走则出栈
                peek-=1
        return path

4.2 QLearning

In the process of algorithm training, first read the current position of the robot, and then add the current state to the Q value table. If the current state already exists in the table, there is no need to add it repeatedly. After that, the generated robot needs to perform actions, return the map reward value, and find the current position of the robot. Then check and update the Q value table again, attenuating the possibility of randomly selecting actions.

In the implementation process of the QLearning algorithm, the calculation update of the Q value table is mainly modified and adjusted. The adjusted Q value table has excellent performance at runtime, fast calculation speed, high accuracy and high stability. The decay rate for the probability of randomly selecting an action is then adjusted. Because it was found during the test that if the decay rate is too slow, the randomness will be too strong, which will indirectly weaken the effect of the reward. Therefore, after adjustment, it is found that the decay rate of 0.5 is an excellent and stable value.

Part of the code is as follows:

    def train_update(self):
        """
        以训练状态选择动作，并更新相关参数
        :return :action, reward 如："u", -1
        """
        self.state = self.maze.sense_robot()  # 获取机器人当初所处迷宫位置

        # 检索Q表，如果当前状态不存在则添加进入Q表
        if self.state not in self.q_table:
            self.q_table[self.state] = {
    
    a: 0.0 for a in self.valid_action}

        action = random.choice(self.valid_action) if random.random() < self.epsilon else max(self.q_table[self.state], key=self.q_table[self.state].get)  # action为机器人选择的动作
        reward = self.maze.move_robot(action)  # 以给定的方向移动机器人,reward为迷宫返回的奖励值
        next_state = self.maze.sense_robot()  # 获取机器人执行指令后所处的位置

        # 检索Q表，如果当前的next_state不存在则添加进入Q表
        if next_state not in self.q_table:
            self.q_table[next_state] = {
    
    a: 0.0 for a in self.valid_action}

        # 更新 Q 值表
        current_r = self.q_table[self.state][action]
        update_r = reward + self.gamma * float(max(self.q_table[next_state].values()))
        self.q_table[self.state][action] = self.alpha * self.q_table[self.state][action] +(1 - self.alpha) * (update_r - current_r)

        self.epsilon *= 0.5  # 衰减随机选择动作的可能性

        return action, reward

The test results are as follows:

If maze_size=3, run the reinforcement learning search algorithm, the final result is as follows:

Figure 6 Reinforcement learning search gif map (size3)

Figure 7 Training results

If maze_size=5, run the reinforcement learning search algorithm, the final result is as follows:

Figure 8 reinforcement learning search gif map (size5)

Figure 9 Training results

If maze_size=10, run the reinforcement learning search algorithm, the final result is as follows:

Figure 10 Reinforcement learning search gif map (size10)

Figure 11 Training results

If maze_size=11, run the reinforcement learning search algorithm, the final result is as follows:

Figure 12 Reinforcement learning search gif map (size11)

Figure 13 Training results

After testing, the reinforcement learning search algorithm can quickly give a path out of the maze and the success rate gradually increases as the number of training rounds increases. When the number of training rounds is sufficient, the final accuracy rate can reach 100%.

4.3 DQNs

On the basis of Q-Learning , neural networks are used to estimate evaluation scores for actions after decision-making. Just replace it with the output of the neural network in the corresponding part of Q-Learning .

The test results are as follows:

If maze_size=3, run the DQN algorithm, the final result is as follows:

Figure 14 Training results
If maze_size=5, run the DQN algorithm, the final result is as follows:

Figure 15 Training results
If maze_size=10, run the DQN algorithm, the final result is as follows:

Figure 16 Training results

4.4 Submit the result test

4.4.1 Basic search algorithm test

Figure 17 Basic search algorithm path

0 seconds

4.4.2 Reinforcement Learning Algorithms (Elementary)

Figure 18 Reinforcement learning algorithm (primary)

0 seconds

4.4.3 Reinforcement Learning Algorithms (Intermediate)

Figure 19 Reinforcement Learning Algorithm (Intermediate)

0 seconds

4.4.4 Reinforcement Learning Algorithms (Advanced)

Figure 20 Reinforcement learning algorithm (advanced)

0 seconds

4.4.5 DQN algorithm (primary)

Figure 21 DQN algorithm (primary)

2 seconds

4.4.6 DQN Algorithm (Intermediate)

Figure 22 DQN algorithm (intermediate)

3 seconds

4.4.7 DQN algorithm (advanced)

Figure 23 DQN algorithm (advanced)

Resource download address : https://download.csdn.net/download/sheziqiong/85631466