Distributed multi-process accelerated DQN algorithm

Distributed multi-process CPU infinite acceleration Deep Q-Learning Network

Significance : The Python language has been criticized for being slow for a long time. Due to the existence of GIL (Global Interpreter Lock, GIL), the Python programs we write can only be processed by one CPU at the same time. And it is now the end of 2022. Which of our computers does not have an 8-core CPU or above? Therefore, if we do not enable the multi-process function, it will really be a waste of our lives. How many 3 seconds can we have in life? In addition, if we only use Python's default single process to train the agent, if the environment becomes complicated and training 1000 rounds takes 12 hours, it will easily make us die of panic (sudden death-wait and wait, we will slowly die of panic) ). However, if we use the multiprocessing library, we can enable our 8-core computer to turn on 6 cores at the same time, so the original 12 hours of training will now only take x hours (0<x<12).

How to combine multi-processing with DQN?

At present, there are roughly two ideas for combining multi-process and deep reinforcement learning algorithms:

The first, the easiest solution to think of:

1. Multiple sub-processes train the network. Each sub-process interacts with the environment independently, collects data, has a separate memory bank, and calculates the corresponding network weight parameters.
2. In the main process, average the weight of the sub-process network and update it to the net.
3. Then pass net to the child process and return to 1

The second, most mainstream solution:

1. Multiple sub-processes do not train the network, but only get the network of the main process to explore the environment, and transfer the obtained data back to the main process through pipe technology (pipe technology, an inter-process communication technology).
2. The main process transfers all The data obtained from the interaction of the sub-process is thrown into the memory bank for network training
3. Pass the updated net to the sub-process and return to 1

Now, let's start experimenting with the second option. Why not start with the first solution? Because I’m not very interested in the first option, I’ll leave the first option for later and do it later when I have free time.

Ok, first of all, the idea of ​​option 2 is - multiple sub-processes interact with the environment independently. There is no doubt that we need to initialize N environments and N sub-processes first in order to run these environments at the same time.

for i in range(PROCESS_NUM):
    p = mp.Process(target=process_env, args=("MountainCar-v0", '进程{}'.format(i),))
    p.start()

where process_env is:

def process_env(env_name,name):
    print(f'子进程:{
      
      name}{
      
      os.getpid()})开始...')
    env=gym.make(env_name).unwrapped
    s=env.reset()
    a=Agent()

Have we successfully started multiple processes? The test results are as follows:

Insert image description here
Secondly, we also need to define an Agent class. N different environments should have N different Agents interacting with it. That is, the choose_action function belongs to each agent uniquely, but the memory bank and learn function are shared by all agents.

At this point, the code is as follows, you can take it away without any thanks, just copy it and use it, but you can’t kill me!

# -*- coding: utf-8 -*-
#开发者:Bright Fang
#开发时间:2022/10/29 15:24
import torch
import torch.nn as nn
import torch.nn.functional as F
import multiprocessing as mp
from multiprocessing import Pipe
from copy import deepcopy
import numpy as np
import gym
from matplotlib import pyplot as plt
import os
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'
Greedy=0.9
MemoryCapacity=2000
LearnSwitch=200
Batch=64
Gamma=0.9
LearningRate=0.01
RENDER=False
Switch=0
PROCESS_NUM=4
env = gym.make("CartPole-v1").unwrapped
'''CartPole的环境状态特征量为推车的位置x、速度x_dot、杆子的角度theta、角速度theta_dot,状态是这四个状态特征所组成的,情况将是无限个,是连续的(即无限个状态),动作是推车向左为0,向右为1,(离散的,有限个,2个)'''
state_number=env.observation_space.shape[0]
action_number=env.action_space.n

def process_env(env_name,pipe):
    env=gym.make(env_name).unwrapped
    s=env.reset()
    reward=0
    while True:
        net=pipe.recv()
        a=Agent(net.cpu())
        action=a.choose_action(s,Greedy)
        s_, r, done, info = env.step(action)
        # env.render()
        x, x_dot, theta, theta_dot = s_
        r1 = (env.x_threshold - abs(x)) / env.x_threshold - 0.8
        r2 = (env.theta_threshold_radians - abs(theta)) / env.theta_threshold_radians - 0.5
        r = 3 * r1 + r2
        # pos,vel=s_
        # if pos>=0.5:
        #     r=100
        reward=reward+r
        data=np.hstack((s,action,r,s_))
        pipe.send(data)
        s=s_
        if done:
            s = env.reset()
            print('r',reward)
            if reward>-150:
                save_data={
    
    'net':a.real_net.state_dict()}
                torch.save(save_data,"E:\process_model_mountaincar.pth")
            reward=0
        #现在子进程永不停止
        # if done:
        #     break

'''搭建神经网络'''
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.in_to_y1=nn.Linear(state_number,20)
        self.in_to_y1.weight.data.normal_(0,0.1)
        self.in_to_y2=nn.Linear(20,10)
        self.in_to_y2.weight.data.normal_(0,0.1)
        self.out=nn.Linear(10,action_number)
        self.out.weight.data.normal_(0,0.1)
    def forward(self,inputstate):
        inputstate=self.in_to_y1(inputstate)
        inputstate=F.relu(inputstate)
        inputstate=self.in_to_y2(inputstate)
        inputstate=torch.sigmoid(inputstate)
        action_Q=self.out(inputstate)
        return action_Q
'''第二步  定义选择动作函数,它接受1*2的状态,输出动作'''
class DQN():
    def __init__(self):
        self.real_net,self.target_net=Net().cuda(),Net().cuda()
        self.memory_counter=0
        self.mem=np.zeros((MemoryCapacity,state_number*2+2))
        self.learn_step=0
        self.random_step=0
        self.act_his=0
        self.lossfunc=nn.MSELoss()
        self.optimizer=torch.optim.Adam(self.real_net.parameters(),lr=LearningRate)

    '''第三步 定义记忆库,从记忆库里选取动作'''
    def store_transition(self,tran):
        # tran=np.hstack((s,a,r,s_))
        index=self.memory_counter%MemoryCapacity
        self.mem[index,:]=tran
        self.memory_counter+=1
    '''第四步 写Qlearning算法'''
    def learn(self):
        if self.learn_step%LearnSwitch==0:
            self.target_net.load_state_dict(self.real_net.state_dict())
        self.learn_step+=1
        sample_index=np.random.choice(MemoryCapacity,Batch)
        new_mem=self.mem[sample_index,:]
        b_s=torch.FloatTensor(new_mem[:,0:state_number]).cuda()
        b_a=torch.LongTensor(new_mem[:,state_number:state_number+1]).cuda()
        b_r=torch.FloatTensor(new_mem[:,state_number+1:state_number+2]).cuda()
        b_s_=torch.FloatTensor(new_mem[:,-state_number:]).cuda()
        real_Q=self.real_net(b_s).gather(1,b_a)
        next_Q=self.target_net(b_s_).detach()
        target_Q=b_r+Gamma*next_Q.max(1)[0].view(Batch,1)
        loss=self.lossfunc(real_Q,target_Q)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

class Agent():
    def __init__(self,net):
        self.real_net=net
        self.optimizer = torch.optim.Adam(self.real_net.parameters(), lr=LearningRate)
    def choose_action(self,inputstate,G=Greedy):
        inputstate=torch.unsqueeze(torch.FloatTensor(inputstate), 0)
        if np.random.uniform()<G:
            action_Q=self.real_net.forward(inputstate)
            action=torch.max(action_Q,1)[1].item()
        else:
            action = np.random.randint(0, action_number)
        return action
'''训练'''
if __name__ == '__main__':
    if Switch==0:
        print("训练中...")
        net=Net()#在主进程里定义一个net,让所有的子进程的神经网络的权重初始值相同
        f = DQN()
        #让主进程里的real_net网络和子进程的real_net网络参数在初始时 相同
        f.real_net.load_state_dict(net.state_dict())
        pipe_dict = dict((i, (child_pipe, main_pipe)) for i in range(PROCESS_NUM) for child_pipe, main_pipe in (Pipe(),))
        [pipe_dict[j][1].send(net) for j in range(PROCESS_NUM)]
        for i in range(PROCESS_NUM):
            p=mp.Process(target=process_env,args=("CartPole-v1",pipe_dict[i][0],))
            p.start()
        while True:
            # if data[3]>50:
            #     print('**************************',data[3])
            for j in range(PROCESS_NUM):
                data=pipe_dict[j][1].recv()
                f.store_transition(data)
            if f.memory_counter>MemoryCapacity and f.memory_counter%5==0:
                f.learn()
                net.load_state_dict(f.real_net.state_dict())
            [pipe_dict[j][1].send(net) for j in range(PROCESS_NUM)]#主进程的网络发到子进程
    else:
        '''使用训练好的网络参数离线测试'''
        print("测试DQN中...")
        c=DQN()
        checkpoint = torch.load("E:\process_model_mountaincar.pth")
        c.real_net.load_state_dict(checkpoint['net'])
        for j in range(10):
            state = env.reset()
            total_rewards = 0
            while True:
                env.render()
                state = torch.unsqueeze(torch.FloatTensor(state), 0).cuda()
                action_Q = c.real_net.forward(state)
                action = torch.max(action_Q, 1)[1].item()
                new_state, reward, done, info = env.step(action)  # 执行动作
                total_rewards += reward
                if done:
                    print("Score", total_rewards)
                    break
                state = new_state
        env.close()

Code usage:
First set the Switch flag to 0, train first, and then stop training directly after 29 seconds (don’t wait), because the parameters of the neural network have been saved in the E drive. Then, set the Switch flag to 1, and you can see the effect of training.
remark:
1. The parameters of the neural network are saved in the E disk of the computer. Don't tell me that your computer does not have an E disk. I didn't change the code myself.
2. I feel that the version information is not important, but I still give it for reference. The gym version I use: 0.20.0; the pytorch version I use: 1.10.0+cu113.

Multi-process CPU acceleration effect test:

1. First, let’s test the convergence of the code. Everyone understands how important convergence is to reinforcement learning~~ So, if it doesn’t converge, let the blogger get out of here and don’t waste everyone’s time! !

I feel that let me express its convergence, all rhetoric is powerless, it is better to let you see the reality for yourself.

Multi-process CartPole environment convergence test

2. The video above can indeed illustrate a problem: that is - there is no problem in writing code for multiple processes, and the data communication between processes and the parameter transmission method of the neural network are all normal. But what about its acceleration effect? Is it speeding up or slowing down? If it speeds up, by how much? After all, if the acceleration effect doesn't make our eyes shine, forget it. If there is no miraculous effect, who is willing to risk baldness and go through multiple processes?

Now let us start testing the superiority of the multi-process algorithm:
Updated on November 2, 2022, test...ce...test 8, and will be tested indefinitely...

For what the code author was thinking when writing:

1. The sub-process runs on the CPU and only interacts with the environment. It does not train and does not involve the learn function. It just gets the data (s, a, r, s_). The operation of putting the data in the memory bank is not in the sub-process. in execution. When writing the code at that time, we were indeed faced with two choices: one was to perform the operation of stuffing data into the memory bank in the sub-process, but this meant sharing data between sub-processes. This sub-process stuffed data, and the other The child process must also update the memory bank. If you are not careful, it can easily cause data security problems. Considering that my IQ is 9, I decisively gave up this plan; secondly, the child process does not stuff the data into the memory bank, but only transfers the data. The operation of entering the main process and stuffing the data into the memory bank is performed in the main process. Now it is the second type.

2. The function of collecting data (i.e. choose_action) is only available to sub-processes, and sub-processes only have the function of collecting data. The functions of storing memory and learn are only available in the main process, and the main process does not have the ability to collect data.

3. The code before changing the multi-process is executed like this: the main process collects data first————the main process stops collecting data and starts learning————the main process stops learning and starts collecting data————… (unlimited sets (Baby go down)

Now after our multi-process rewritten code, its acceleration principle is as follows: the main process is always learning (training the weights of the neural network), and while the main process is learning, 4 sub-processes are collecting data for it. Learn and choose_action are going on at the same time, that is, we can now collect data while learning, and can do two things at the same time.

Let me clear up some minefields for you:

What you get with this command is not the number of CPU cores:

Insert image description here
My computer is a Dell gaming cartridge 15G (i5-112600H+RTX3050). Does the number 12 above mean that my computer is a 12-core computer? no! ! ! After actual testing, let alone running 12 cores at the same time, even if I turn on 6 cores, the computer will crash, as shown in the figure below:
Insert image description here
6 processes are opened, and an error message is reported: The page file is too small and the operation cannot be completed.

After going through Baidu, if you want to solve this error, you can set up your computer like this . But in fact, the reason for the error is not here, but because my computer is a 6-core computer. How do you know about 6 cores? Open the task manager:
Insert image description here
No wonder I can only open up to 5 child processes o(  ̄▽ ̄ )o. For the sake of computer security, I will only open 4 processes.

A small detail from someone else:

When the sub-process explores the environment, all models can be run on the CPU. This can prevent the memory of the graphics card from overflowing. The model of the main process is what we need to update, so it is run on the GPU.

In this case, the utilization rate of both CPU and GPU can be above 90%. If we only use the CPU, the CPU utilization rate will often reach 99%, while the GPU utilization rate is 1%; if we only use the GPU to train the model, the GPU will run crazy, and the CPU will be lazy. Alas, for the sake of a bowl of water, I didn’t even think about the CPU or the GPU, they were all running hard for me! ! ! Squeezing out all the computing resources of the computer at the same time, I am simply a beast o( )o, I know my computer must want to thank me.

The computer said: I would really like to thank you!

Me: You’re welcome!

references:

Bilibili video: Python concurrent programming in action, using multi-threads, multi-processes, and multi-coroutines to speed up program execution

CSDN: DPPO deep reinforcement learning algorithm implementation ideas (distributed multi-process acceleration)

Guess you like

Origin blog.csdn.net/fangchenglia/article/details/127672391