项目分享 | 如何通过昇思MindSpore实现强化学习玩游戏

The open source China community team made its first live broadcast, telling the story behind the open source China community in the name of sharing."

Author: Hamwon

Summary

"Playing Atari with Deep Reinforcement Learning" is the first classic deep reinforcement learning paper that combines reinforcement learning with deep learning. It was designed and developed by the DeepMind team. The algorithm was tested in the Atari 2600 game environment, and its test performance in some games was better than Human players.

Paper URL: https://paperswithcode.com/paper/playing-atari-with-deep-reinforcement

Create a virtual environment project with Pycharm

The project code and training results have been uploaded to Baidu Netdisk and can be downloaded first. However, because the virtual environment is too large, they are not uploaded. You need to download and install it yourself. For specific operations, please see the introduction below.
Link: https://pan.baidu.com/s/1zoh0glqH4xcNSbOUuR2r7g?pwd=00wdExtraction
code: 00wd

First create a new project using Pycharm and then add the virtual environment in the settings as shown below:

The purpose of creating a virtual environment project is to separate the current project's running environment from your own Python environment. You will then install the required packages in the virtual environment to avoid affecting your previous Python environment. The Pycharm version I use is the 2019 version. The settings of the new version of Pycharm should be similar. You can Baidu according to your own situation. Everyone's Anaconda path is different, and you need to choose the basic interpreter according to your installation location.

For the configuration of the virtual environment, refer to the CSDN article: Pycharm creates and manages a virtual environment.

After the virtual environment is created, you still need to set up the terminal program in the settings:

At this time, open the terminal tab under Pycharm, and you can see the prompt (venv) in front of the terminal, indicating that the current terminal is in a virtual environment:

At this time, all the packages we need can be installed through pip in this terminal.

Remember to copy the three folders code, Imgs, and model downloaded from Baidu Cloud to the current project folder. The Python packages required for the project are already included in the requirements.tx file under the code folder . Open the terminal tab of Pycharm and enter the code folder through the cd command:

cd code

Then pip installs the required packages:

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Normally, after the above environment is configured, the code in the code folder should be able to run normally. If it cannot run normally, there may be a problem with Atari's game environment. For details, please refer to this CSDN article: [gym] New version installation (0.21 or above) and configuration of Atari environment, super simple (Windows).

Paper model explanation

To put it simply, the paper designs a DQN network, which stacks 4 consecutive frames of game images cropped into 84X84 into 4X84X84 input, and then uses convolution + ReLU, convolution + ReLU, Flatten, full connection + ReLU, and full connection. Get the output corresponding to the action dimension. Here we mainly train and test the game BreakOut (pinball and block). There are 4 types of actions corresponding to this game, so the output dimension here is 4.

The output 4-dimensional array represents the Q(s,a) values corresponding to the four actions respectively. The number corresponding to the largest Q value is selected as the action code output by the network:

0: means no movement

1: Indicates starting the game (if the game has already started, then 1 still does not move)

2: Indicates right shift

3: Indicates left shift

Convolution size calculation:

Output size = (input size - convolution kernel size + 2 x padding) / stride + 1

Using the above DQN network as an agent agent, the interaction between the agent and the game environment can be realized. The network generates actions based on the current observations , controls the slider at the bottom, and changes in the environment generate new observations . When the slider successfully moves the ball Bounce and hit the blocks above. You will get reward = 1 for each block you hit, otherwise there will be no reward = 0.

The next thing that needs to be solved is how the reinforcement learning algorithm continuously updates the parameters of the agent's DQN network through the interaction between the agent and the environment, so that the agent learns to play the game.

The reinforcement learning algorithm saves the interaction experience between the agent and the environment, and can obtain a series of experience tuples, namely (current observation, action, next observation, reward, end mark). According to the writing method in the paper, it can be expressed as:

We input the current observation in the experience tuple into the network, and the output of the network is a 4-dimensional array, corresponding to the values of 4 actions taken under the current observation. According to the actual action taken by the current observation in the experience tuple , we can get the value corresponding to the action taken by the current observation from this array . This value is obviously related to the current parameter θ of the network.

In fact, according to the Bellman equation on which reinforcement learning is based, we can also estimate the current value through the reward obtained after taking the action and the expected value of the corresponding next observation. This estimated value is:

For the end-of-game situation, there is only the reward obtained after taking the action, there is no next observation, so the current value estimate is the reward.

For the situation where the game is not over, the current value estimate includes the reward obtained, plus the largest estimated value among the 4 actions taken in the next observation, multiplied by the discount factor γ. This discount factor represents the current value and the subsequent value. Are the connections between them close? If 0, it means it is not close. The current value only depends on the current reward. The larger the γ, the closer the connection.

Now we have:

The estimated value corresponding to the action taken by the current observation obtained using the DQN network .

The current value is estimated using the reward and the maximum estimated value of the next observation multiplied by the discount factor . According to the Bellman equation, the two estimates should be equal, however because the network's estimate of the value is inaccurate, there is a difference between the two estimates:

The goal of deep reinforcement learning algorithm for network training is to reduce the difference between the two estimates, so that the DQN network strategy satisfies the Bellman equation, so that the DQN network can learn the optimal strategy of this game. In the process of the agent DQN network interacting with the environment, the above purpose can be achieved by using the saved experience tuples to calculate the above Loss, and then updating the parameters of the DQN network with gradient descent.

下面将给出Python环境下利用昇思MindSpore框架的具体代码实现，并给出相应解释。

昇思MindSpore代码实现

Open the playing_atari.py file in the code folder. The specific meaning of the code is as follows:

3.1 Game environment creation

After importing the corresponding libraries, first create the game environment env:

env = gym.make("BreakoutNoFrameskip-v4")  # 游戏环境
env = gym.wrappers.RecordEpisodeStatistics(env)
env = gym.wrappers.ResizeObservation(env, (84, 84))  # 设置图片放缩
env = gym.wrappers.GrayScaleObservation(env)  # 设置图片为灰度图
env = gym.wrappers.FrameStack(env, 4)  # 4帧图片堆叠在一起作为一个观测
env = MaxAndSkipEnv(env, skip=4)  # 跳帧，一个动作维持4帧

The environment has been encapsulated here env, and its output images have been preprocessed. Each observation output is a 4X84X84 stacked grayscale image.

3.2 DQN network definition

利用昇思MindSpore定义DQN网络，直接利用nn.SequentialCell()，按设计的网络进行定义即可：

class DQN(nn.Cell):
    def __init__(self, nb_actions):
        super().__init__()
        self.network = nn.SequentialCell(
            nn.Conv2d(in_channels=4, out_channels=16, kernel_size=8, stride=4, pad_mode='valid'),
            nn.ReLU(),
            nn.Conv2d(in_channels=16, out_channels=32, kernel_size=4, stride=2, pad_mode='valid'),
            nn.ReLU(),
            nn.Flatten(),
            nn.Dense(in_channels=2592, out_channels=256),
            nn.ReLU(),
            nn.Dense(in_channels=256, out_channels=nb_actions),
        )

    def construct(self, x):
        return self.network(x / 255.)

construct() represents the output of the network, similar to forward() in the Pytorch framework

3.3 Design experience storage pool

class ReplayBuffer():
    def __init__(self, replay_memory_size):

    def add(self, obs, next_obs, action, reward, done):

    def sample(self, sample_num):
        ...
        return Tensor(temp_obs, ms.float32), Tensor(temp_next_obs, ms.float32), Tensor(temp_action, ms.int32), Tensor(temp_reward, ms.float32), Tensor(temp_done, ms.float32)

The specific code will not be posted here. Simply put, it realizes the storage of experience tuples and batch sampling to facilitate subsequent neural network training.

3.4 Definition of loss function, optimizer, and training function

First instantiate a network for the defined DQN class q_network, then define the optimizer as nn.Adamand the loss function asnn.HuberLOss()

q_network = DQN(nb_actions=env.action_space.n)  # 网络实例化
optimizer = nn.Adam(params=q_network.trainable_params(), learning_rate=1.25e-4)  # 优化器
loss_fn = nn.HuberLoss()  # 损失函数

后面是昇思MindSpore定义网络训练时特有的步骤，叫函数式自动微分，可以参考官网关于函数式自动微分的教程。具体而言就是先定义一个Loss计算函数forward_fn，然后根据Loss计算函数生成梯度计算函数grad_fn，然后利用梯度计算函数来定义网络训练一步的函数train_step。这样利用train_step函数，只需要输入所需要的数据，就可以对网络的参数进行一次更新，完成一步训练。

# 损失值计算函数
def forward_fn(observations, actions, y):
    current_q_value = q_network(observations).gather_elements(dim=1, index=actions).squeeze()  # 把经验对中这个动作对应的q_value给提取出来
    loss = loss_fn(current_q_value, y)
    return loss

In forward_fnthe function, the calculation of the value estimate is completed . In the code current_q_value, the DQN network is required to calculate the Q value in this function. The subsequent gradient of Loss will be back-propagated to the DQN network parameters in this calculation process. Updates for neural networks.

Note ythat the input of the function is input after it is calculated outside the function, because the calculation of about also requires the DQN network, and the gradient of Loss should not be back propagated into the calculation process of about, otherwise the network parameters will not be updated. Stablize. Therefore, the calculation about needs to be calculated outside the function and then input to forward_fn.

# 损失梯度函数
grad_fn = ms.ops.value_and_grad(forward_fn, None, optimizer.parameters)  
# 参考:https://www.mindspore.cn/tutorials/zh-CN/r2.1/beginner/autograd.html

# 训练一步的函数
def train_step(observations, actions, y):
    loss, grads = grad_fn(observations, actions, y)
    optimizer(grads)
    return loss

ms.ops.value_and_gradUsing the defined Loss calculation function forward_fn, a gradient calculation function grad_fn can be returned.

Then in the training function train_step, we can grad_fncalculate the gradient, and then use the optimizer optimizerto perform gradient backpropagation, update the network parameters, and complete one-step training.

3.4 Network training

Next, you can train the network. Here are some explanations of the main key codes:

def Deep_Q_Learning(env, replay_memory_size=100_000, nb_epochs=40000_000, update_frequency=4, batch_size=32,
                    discount_factor=0.99, replay_start_size=5000, initial_exploration=1, final_exploration=0.01,
                    exploration_steps=100_000):

First, define the relevant parameters required for training, including the experience pool capacity size 100_000, the total training epochs=40000_000, the network parameters are updated every 4 epochs, the discount factor is 0.99, training starts when the experience pool is full of 5000, and the initial exploration probability is 1. Total exploration epochs is 100_000

Exploration here means that in order for DQN to learn a better strategy, actions are randomly generated for exploration before training. The exploration probability will gradually decrease, and then it will rely entirely on DQN to generate actions. This strategy is called the ε-greedy strategy.

Before training, set the network to training mode:

q_network.set_train()  # 设置网络为训练模式

Then let DQN interact with the game, and the corresponding code to generate actions is (random exploration or actions generated by DQN):

if random.random() < epsilon:  # With probability ε select a random action a
    action = np.array(env.action_space.sample())
else:  # Otherwise select a = max_a Q∗(φ(st), a; θ)
    temp_input = Tensor(obs, ms.float32).unsqueeze(0)
    q_values = q_network(temp_input)
    action = q_values.argmax(axis=1).item().asnumpy()

Save each experience tuple to the experience pool:

rb.add(obs, real_next_obs, action, reward, done)

Pick a batch of experience tuples from the experience pool, calculate the values, and use train_stepthe function to update the network parameters:

data_obs, data_next_obs, data_action, data_reward, data_done = rb.sample(batch_size)
# 这一部分不用求梯度，所以写在forward_fn和train_step函数之外
max_q_value = q_network(data_next_obs).max(1)
y = data_reward.flatten() + discount_factor * max_q_value * (1 - data_done.flatten())
loss = train_step(data_obs, data_action, y)

Note that it is calculated here, because there is no need for gradient backpropagation into the calculation process, so it is calculated here first, and then input to the previously defined train_stepfunction to complete a training session.

接下来就是漫长的训练过程了，使用自己的笔记本大概训练了10天左右。测试了一下在自己笔记本上的训练速度大概是和Pytorch差不多的。昇思MindSpore是可以利用华为昇腾910(华为研发的AI芯片)加速训练的，但是作为穷学生党想想还是算了，所以有没有大佬vivo50让我试试速度(｀・ω・´)。

The training curve is as follows. The result of training by the boss on github using Pytorch is about 200. Limited by the notebook memory of 16g, the best result that can be trained under the current experience pool capacity is about 150.

Experimental results

It can be seen that after training, DQN has learned to play this game. Generally, it can get about 150 points. If you are lucky, you can get 300 points like this: