Python Neural Network Learning (7)--Reinforcement Learning--Using Neural Network

foreword

I mentioned reinforcement learning earlier, but only a table is used, which also belongs to the category of reinforcement learning. After all, reinforcement learning belongs to learning through trial and error.

But now there are some problems, what if the table is very large? Cliff hiking is just a table with a length of 12 and a width of 4, with 4 actions for each position. If the game is League of Legends, there are so many positions and so many possible actions for each position. It is simply unimaginable to draw a table.

But in fact, if you think of this table as a mathematical function, its input is coordinates, and its output is an action (or the value corresponding to each action):

That is to say, as long as we have a coordinate and get an action, we don’t need to worry about the process in the middle. Remember what was said in this article : neuron (function) + neuron (function) = neural network (artificial neural network) network), then the middle part can be replaced by a neural network, which is deep reinforcement learning.

Paper (Playing Atari with Deep Reinforcement Learning) address: https://arxiv.org/abs/1312.5602

 set environment

Note: I have modified today's environment code, which is different from the previous one, so you still have to read the environment code first.

The setting of the chessboard size has been added to the environment code this time, and some bugs have been fixed.

# -*- coding: utf-8 -*-
"""
作者:CSDN,chuckiezhu
作者地址:https://blog.csdn.net/qq_38431572
本文可用作学习使用,交流代码时需要附带本出处声明
"""

import random
import numpy as np

from gym import spaces

"""
nrows
     0  1  2  3  4  5  6  7  8  9  10  11  ncols
   ---------------------------------------
0  |  |  |  |  |  |  |  |  |  |  |   |   |
   ---------------------------------------
1  |  |  |  |  |  |  |  |  |  |  |   |   |
   ---------------------------------------
2  |  |  |  |  |  |  |  |  |  |  |   |   |
   ---------------------------------------
3   * |       cliff                  | ^ |

  *: start point
  cliff: cliff
  ^: goal
"""

class CustomCliffWalking(object):
    def __init__(self, stepReward: int=-1, cliffReward: int=-10, goalReward: int=10, col=12, row=4) -> None:
        self.sr = stepReward
        self.cr = cliffReward
        self.gr = goalReward
        self.col = col
        self.row = row

        self.action_space = spaces.Discrete(4)  # 上下左右
        self.reward_range = (cliffReward, goalReward)

        self.pos = np.array([row-1, 0], dtype=np.int8)  # agent 在3,0处出生,掉到悬崖内就会死亡,触发done和cliffReward

        self.die_pos = []
        for c in range(1, self.col-1):
            self.die_pos.append([self.row-1, c])
        print("die pos: ", self.die_pos)
        print("goal pos: ", [[self.row-1, self.col-1]])

        self.reset()
    
    def reset(self, random_reset=False):
        """
        初始化agent的位置
        random: 是否随机出生, 如果设置random为True, 则出生点会随机产生
        """
        x, y = self.row-1, 0
        if random_reset:
            y = random.randint(0, self.col-1)
            if y == 0:
                x = random.randint(0, self.row-1)
            else:  # 除了正常坐标之外,还有一个不正常坐标:(3, 0)
                x = random.randint(0, self.row-2)
            # 严格来讲,cliff和goal不算在坐标体系内
        # agent 在3,0处出生,掉到悬崖内就会死亡,触发done和cliffReward
        self.pos = np.array([x, y], dtype=np.int8)
        # print("reset at:", self.pos)
    
    def step(self, action: int) -> list[list, int, bool, bool, dict]:
        """
        执行一个动作
        action:
            0: 上
            1: 下
            2: 左
            3: 右
        """

        move = [
            np.array([-1, 0], dtype=np.int8), # 向上,就是x-1, y不动,
            np.array([ 1, 0], dtype=np.int8), # 向下,就是x+1, y不动,
            np.array([0, -1], dtype=np.int8), # 向左,就是y-1, x不动,
            np.array([0,  1], dtype=np.int8), # 向右,就是y+1, x不动,
        ]
        new_pos = self.pos + move[action]
        # 上左不能小于0
        new_pos[new_pos < 0] = 0  # 超界的处理,比如0, 0 处向上或者向右走,处理完还是0,0
        # 上右不能超界
        if new_pos[0] > self.row-1:
            new_pos[0] = self.row-1  # 超界处理
        if new_pos[1] > self.col-1:
            new_pos[1] = self.col-1

        reward = self.sr  # 每走一步的奖励
        die = False
        win = False
        info = {
            "reachGoal": False,
            "fallCliff": False,
        }
        
        if self.__is_pos_die(new_pos.tolist()):
            die = True
            info["fallCliff"] = True
            reward = self.cr
        elif self.__is_pos_win(new_pos.tolist()):
            win = True
            info["reachGoal"] = True
            reward = self.gr

        self.pos = new_pos  # 更新坐标
        return new_pos, reward, die, win, info
    
    def __is_pos_die(self, pos: list[int, int]) -> bool:
        """判断自己的这个状态是不是已经结束了"""
        return pos in self.die_pos

    def __is_pos_win(self, pos: list[int, int]) -> bool:
        """判断自己的这个状态是不是已经结束了"""
        return pos in [
            [self.row-1, self.col-1],
        ]

As for explaining this environment, I think this comment is relatively clear. If you don’t understand anything, please leave a comment and let me know.

production network

First of all, we first substitute ourselves into the table. If we stand at a certain coordinate, then we should know the rewards in the four directions. Therefore, the network can have two methods;

method one,

The network input is coordinates and directions, and the output is the corresponding reward.

Method two,

The network input is the coordinates, and the output is the reward corresponding to the four directions.

Here I want to come up with an off-site reasoning: Method 1 is really troublesome, and when choosing an action, how many actions need to go through the network many times. So method 2 is a better choice.


class Qac(nn.Module):
    def __init__(self, in_shape, out_shape) -> None:
        super(Qac, self).__init__()
        self.in_shape = in_shape  # 就是 智能体 现在的坐标
        self.action_space = out_shape  # 上0下1左2右3
        self.dense1 = nn.Linear(self.in_shape, self.action_space)
        # 输出就是每个动作的价值

        self.lrelu = nn.LeakyReLU()  # 换用tanh
        self.softmax = nn.Softmax(-1)
    
    def forward(self, x) -> torch.Tensor:
        x = self.dense1(x)
        return x

    def sample_action(self, action_value: torch.Tensor, epsilon: float):
        """从产生的动作概率中采样一个动作,利用epsilon贪心"""
        if random.random() < epsilon:
            # 随机选择
            action = random.randint(0, self.action_space-1)
            action = torch.tensor(action)
        else:
            action = torch.argmax(action_value)
        
        return action
    
    def load_model(self, modelpath):
        """加载模型"""
        tmp = torch.load(modelpath)
        self.load_state_dict(tmp["model"])
    
    def save_model(self, modelpath):
        """保存模型"""
        tmp = {
            "model": self.state_dict(),
        }
        torch.save(tmp, modelpath)

Careful people may have discovered that this network has only one layer, which is very simple. It seems that there is no so-called "feature extraction" and it goes directly to the output layer. There is a little trick here, that is, I manually converted the coordinates into onehot vectors, which can be considered as manual extraction of features.

def num_to_onehot(pos: torch.Tensor) -> torch.Tensor:
    """把坐标转成one_hot向量"""
    n = int((pos[0] * 12 + pos[1]).item())
    return nn.functional.one_hot(torch.tensor(n), num_classes=48)

If you use a two-layer neural network and directly input the coordinates, the middle layer is 48, and then an output layer, it is also possible, but I tried it, the training is very slow, and the effect is not good. It's better to code it manually like this.

train

I pasted the entire training code directly here:

# -*- coding: utf-8 -*-
"""
利用DQN实现
"""
"""
作者:CSDN,chuckiezhu
作者地址:https://blog.csdn.net/qq_38431572
本文可用作学习使用,交流代码时需要附带本出处声明
"""
import os
import random
import torch
import numpy as np
from torch import nn

from matplotlib import pyplot as plt

from cliff_walking_env import CustomCliffWalking


nepisodes = 10000  # total 1w episodes
epsilon = 1.0  # epsilon greedy policy
epsilon_min = 0.05
epsilon_decay = 0.9975

gamma = 0.9  # discount factor
lr = 0.001
random_reset = False

seed = 42

normalization = torch.tensor([3, 11], dtype=torch.float)

sr = -1
cr = -10
gr = 10

class Qac(nn.Module):
    def __init__(self, in_shape, out_shape) -> None:
        super(Qac, self).__init__()
        self.in_shape = in_shape  # 就是智能体现在的坐标
        self.action_space = out_shape  # 上0下1左2右3
        self.dense1 = nn.Linear(self.in_shape, self.action_space)

        # 输出就是每个动作的价值

        self.lrelu = nn.LeakyReLU()  # 换用tanh
        self.softmax = nn.Softmax(-1)
    
    def forward(self, x) -> torch.Tensor:
        x = self.dense1(x)
        return x

    def sample_action(self, action_value: torch.Tensor, epsilon: float):
        """从产生的动作概率中采样一个动作,利用epsilon贪心"""
        if random.random() < epsilon:
            # 随机选择
            action = random.randint(0, self.action_space-1)
            action = torch.tensor(action)
        else:
            action = torch.argmax(action_value)
        
        return action
    
    def load_model(self, modelpath):
        """加载模型"""
        tmp = torch.load(modelpath)
        self.load_state_dict(tmp["model"])
    
    def save_model(self, modelpath):
        """保存模型"""
        tmp = {
            "model": self.state_dict(),
        }
        torch.save(tmp, modelpath)


def num_to_onehot(pos: torch.Tensor) -> torch.Tensor:
    """把坐标转成one_hot向量"""
    n = int((pos[0] * 12 + pos[1]).item())
    return nn.functional.one_hot(torch.tensor(n), num_classes=48)

    
def main():
    global epsilon
    random.seed(seed)
    torch.manual_seed(seed=seed)
    plt.ion()

    os.makedirs("./out/ff_DQN/")
    # cw = gym.make("CliffWalking-v0", render_mode="human")
    cw = CustomCliffWalking(stepReward=sr, goalReward=gr, cliffReward=cr)

    # 专程onehot了
    Q = Qac(in_shape=48, out_shape=cw.action_space.n)

    optimizer = torch.optim.Adam(Q.parameters(), lr=lr)
    loss_fn = torch.nn.MSELoss()

    win_1000 = []  # 记录最近一千场赢的几率
    total_win = 0
    for i in range(1, nepisodes+1):
        cw.reset(random_reset=random_reset)  # 重置环境
        steps = 0
        while True:
            steps += 1
            state_now = torch.tensor(cw.pos, dtype=torch.float)
            state_now = num_to_onehot(state_now).unsqueeze_(0).to(torch.float)
            action_values = Q(state_now)
            action_values = action_values.squeeze()
            action_now = Q.sample_action(action_value=action_values, epsilon=epsilon)

            action_now_value = action_values[action_now]  # 这个是采取这个动作的预测奖励

            state_next, reward_now, terminated, truncated, info = cw.step(action=action_now.item())   # 执行动作
            state_next = num_to_onehot(state_next).unsqueeze_(0).to(torch.float)
            with torch.no_grad():
                next_values = Q(state_next)
                next_values = next_values.squeeze()
                # 得到下一个的动作,(同一个策略下,因为这是onpolicy的sarsa
                action_next = Q.sample_action(action_value=action_values, epsilon=epsilon)
                action_next_value = next_values[action_next]  # 计算下一个动作的预期价值

            
            # 计算  instantR + gamma * value_next,这个是实际上这个动作带来的预期收益
            discounted_reward = reward_now + gamma * action_next_value * (1 - terminated) * (1 - truncated)

            # 计算误差
            loss = loss_fn(action_now_value, discounted_reward)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if terminated or truncated:
                if terminated:
                    win_1000.append(0)
                if truncated:
                    win_1000.append(1)
                    total_win += 1
                break

            epsilon = epsilon * epsilon_decay
            epsilon = max(epsilon, epsilon_min)  # 衰减学习旅
        win_1000 = win_1000[-1000:]
        win_rate = sum(win_1000)/1000.0
        print("{}/{}, 当前探索率: {}, 是否成功: {}, 千场胜率:{}.".format(i, nepisodes, epsilon, truncated, win_rate), flush=True)
        if i % 10000 == 0:
            Q.save_model("./out/ff_DQN/Qac_{}_{}_{}_{}.pth".format(i, gr, cr, win_rate))
    print("total win: ", total_win)

    # 收尾测试看看能不能通关
    path = np.zeros((4, 12), dtype=np.float64)
    cw.reset(random_reset=False)

    steps = 0
    while steps <= 48:  # 走,48步走不到头就不会走到了
        steps += 1
        state_now = torch.tensor(cw.pos, dtype=torch.float)
        state_now = num_to_onehot(state_now).unsqueeze_(0).to(torch.float)
        action_values = Q(state_now).squeeze()
        # 贪心算法选择动作
        action_now = Q.sample_action(action_values, 0)
        print(cw.pos[0], cw.pos[1], action_now)
        new_pos, _, die, win, _ = cw.step(action=action_now)
        if win:
            print("[+] you win!")
            break
        if die:
            print("[+] you lose!")
            break
        x = new_pos[0]
        y = new_pos[1]
        if x >= 0 and x <= 3 and y >= 0 and y <= 11:
            path[x, y] = 1.0
    plt.imshow(path)
    plt.colorbar()
    plt.savefig("./out/ff_DQN/path_sarsa_"+str(sr)+"_"+str(gr)+"_"+str(cr)+".png")

if __name__ == "__main__":
    main()

I have tested the above code without any problem. It is completely possible to use it directly without modification. The directory structure is as follows:

Those two folders are automatically generated and do not need to be created manually. 

Network structure analysis

This is the network structure and update process of the code above. Note: Solid lines represent gradients, dashed lines represent no gradients.

Every time a state is generated by the environment, it is first converted into a one_hot vector, which is used as the input of the network to get the value of the four actions. Then the sampled action gets the current Q(s, a) value, which is action_value.

On the other hand, the sampled actions are fed into the environment, which gives the next state and immediate reward. The next state is sent to the network (without gradient calculation), and the value of the four actions is also obtained. Since the code uses the SARSA algorithm, it is necessary to sample an action according to the same strategy and get the value of the action at the same time. That is next_action_value.

At this time, a ground truth can be obtained according to the immediate reward reward_now of the environment and the next_action_value of the action in the next state, and the action_value is used as the predicted value of the network, and these two can be used to calculate the loss.

The backpropagation of the loss is passed along the implementation to the top. Implement network updates.

 

Guess you like

Origin blog.csdn.net/qq_38431572/article/details/131488148