Deep Q Network: DQN Project Combat CartPole-v0

Abstract: Compared with Q learning, DQN is essentially to adapt to a more complex environment, and after continuous improvement and iteration, it is basically perfect when it comes to Nature DQN (that is, the Nature paper published by Volodymyr Mnih).

This article is shared from Huawei Cloud Community " Reinforcement Learning from Basic to Advanced-Case and Practice [4.1]: Deep Q Network-DQN Project Combat CartPole-v0 ", author: Ting.

1. Define the algorithm

Compared with Q learning, DQN is essentially to adapt to a more complex environment, and after continuous improvement and iteration, it is basically perfect when it comes to Nature DQN (that is, the Nature paper published by Volodymyr Mnih). There are three main changes in DQN:

  • Replace the original Q table with a deep neural network: this is easy to understand why
  • Using experience replay (Replay Buffer): This has many benefits. One is to use a bunch of historical data for training, which is much better than throwing it away after using it once before, which greatly improves the sample efficiency. The other is often mentioned in interviews, reducing samples The correlation between them, in principle, the acquisition of experience is separated from the learning phase. The original time-series training data may be unstable, and learning after disruption can help improve the stability of training, which is similar to the division of training and testing in deep learning. There is a reason to shuffle the samples during collection.
  • Two networks are used: the policy network and the target network. The policy network parameters updated in each step are copied to the target network every few steps. This is also for the stability of the training and to avoid the divergence of the estimated Q value. Imagine if there is currently a transition (mentioned in this Q learning, you must remember!!!) samples that lead to a poor overestimation of the Q value, if the samples extracted from the experience playback are exactly If this happens several times in a row, it is very likely that the Q value will diverge (its youthful bird will never come back). For another example, we play RPG or breakout games. Some people often save and load in order to break the record. As long as I make a mistake and I am not satisfied, I will load the previous save. Suppose it is not allowed to load, just like the DQN algorithm. During the training process, you will not be able to retreat. At this time, do you have two files, one file is saved for each frame, and the other file is saved after a good result, that is, several intervals are saved again, and at the end, the interval is several steps Files saved several times are generally better than files saved every frame. Of course, you can also create more files, that is, add multiple target networks to DQN, but it is not necessary for DQN, and the effect of more networks may not be much better.

1.1  Define the model

import paddle
import paddle.nn as nn
import paddle.nn.functional as F
!pip uninstall -y parl
!pip install parl
import parl
from parl.algorithms import DQN
class MLP(parl.Model):
 """ Linear network to solve Cartpole problem.
 Args:
 input_dim (int): Dimension of observation space.
 output_dim (int): Dimension of action space.
    """
 def __init__(self, input_dim, output_dim):
 super(MLP, self).__init__()
        hidden_dim1 = 256
        hidden_dim2 = 256
        self.fc1 = nn.Linear(input_dim, hidden_dim1)
        self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.fc3 = nn.Linear(hidden_dim2, output_dim)
 def forward(self, state):
        x = F.relu(self.fc1(state))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
 return x

1.2 Define Experience Playback

from collections import deque
class ReplayBuffer:
 def __init__(self, capacity: int) -> None:
 self.capacity = capacity
 self.buffer = deque(maxlen=self.capacity)
 def push(self,transitions):
 '''_summary_
 Args:
 trainsitions (tuple): _description_
        '''
 self.buffer.append(transitions)
 def sample(self, batch_size: int, sequential: bool = False):
 if batch_size > len(self.buffer):
 batch_size = len(self.buffer)
 if sequential: # sequential sampling
            rand = random.randint(0, len(self.buffer) - batch_size)
            batch = [self.buffer[i] for i in range(rand, rand + batch_size)]
 return zip(*batch)
 else:
            batch = random.sample(self.buffer, batch_size)
 return zip(*batch)
 def clear(self):
 self.buffer.clear()
 def __len__(self):
 return len(self.buffer)

1.3 Define the agent

from random import random
import parl
import paddle
import math
import numpy as np
class DQNAgent(parl.Agent):
 """Agent of DQN.
    """
 def __init__(self, algorithm, memory,cfg):
 super(DQNAgent, self).__init__(algorithm)
 self.n_actions = cfg['n_actions']
 self.epsilon = cfg['epsilon_start']
 self.sample_count = 0 
 self.epsilon_start = cfg['epsilon_start']
 self.epsilon_end = cfg['epsilon_end']
 self.epsilon_decay = cfg['epsilon_decay']
 self.batch_size = cfg['batch_size']
 self.global_step = 0
 self.update_target_steps = 600
 self.memory = memory # replay buffer
 def sample_action(self, state):
 self.sample_count += 1
 # epsilon must decay(linear,exponential and etc.) for balancing exploration and exploitation
 self.epsilon = self.epsilon_end + (self.epsilon_start - self.epsilon_end) * \
 math.exp(-1. * self.sample_count / self.epsilon_decay) 
 if random.random() < self.epsilon:
            action = np.random.randint(self.n_actions)
 else:
            action = self.predict_action(state)
 return action
 def predict_action(self, state):
        state = paddle.to_tensor(state , dtype='float32')
 q_values = self.alg.predict(state) # self.alg 是自带的算法
        action = q_values.argmax().numpy()[0]
 return action
 def update(self):
 """Update model with an episode data
 Args:
 obs(np.float32): shape of (batch_size, obs_dim)
            act(np.int32): shape of (batch_size)
            reward(np.float32): shape of (batch_size)
 next_obs(np.float32): shape of (batch_size, obs_dim)
            terminal(np.float32): shape of (batch_size)
        Returns:
            loss(float)
        """
 if len(self.memory) < self.batch_size: # when transitions in memory donot meet a batch, not update
 return
 if self.global_step % self.update_target_steps == 0:
 self.alg.sync_target()
 self.global_step += 1
 state_batch, action_batch, reward_batch, next_state_batch, done_batch = self.memory.sample(
 self.batch_size)
 action_batch = np.expand_dims(action_batch, axis=-1)
 reward_batch = np.expand_dims(reward_batch, axis=-1)
 done_batch = np.expand_dims(done_batch, axis=-1)
 state_batch = paddle.to_tensor(state_batch, dtype='float32')
 action_batch = paddle.to_tensor(action_batch, dtype='int32')
 reward_batch = paddle.to_tensor(reward_batch, dtype='float32')
 next_state_batch = paddle.to_tensor(next_state_batch, dtype='float32')
 done_batch = paddle.to_tensor(done_batch, dtype='float32')
        loss = self.alg.learn(state_batch, action_batch, reward_batch, next_state_batch, done_batch) 

2. Define training

def train(cfg, env, agent):
 ''' 训练
    '''
 print(f"开始训练!")
 print(f"环境:{cfg['env_name']},算法:{cfg['algo_name']},设备:{cfg['device']}")
    rewards = [] # record rewards for all episodes
    steps = []
 for i_ep in range(cfg["train_eps"]):
 ep_reward = 0 # reward per episode
 ep_step = 0
        state = env.reset() # reset and obtain initial state
 for _ in range(cfg['ep_max_steps']):
 ep_step += 1
            action = agent.sample_action(state) # sample action
 next_state, reward, done, _ = env.step(action) # update env and return transitions
 agent.memory.push((state, action, reward,next_state, done)) # save transitions
            state = next_state # update next state for env
 agent.update() # update agent
 ep_reward += reward  #
 if done:
 break
 steps.append(ep_step)
 rewards.append(ep_reward)
 if (i_ep + 1) % 10 == 0:
 print(f"回合:{i_ep+1}/{cfg['train_eps']},奖励:{ep_reward:.2f},Epislon: {agent.epsilon:.3f}")
 print("完成训练!")
 env.close()
 res_dic = {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}
 return res_dic
def test(cfg, env, agent):
 print("开始测试!")
 print(f"环境:{cfg['env_name']},算法:{cfg['algo_name']},设备:{cfg['device']}")
    rewards = [] # record rewards for all episodes
    steps = []
 for i_ep in range(cfg['test_eps']):
 ep_reward = 0 # reward per episode
 ep_step = 0
        state = env.reset() # reset and obtain initial state
 for _ in range(cfg['ep_max_steps']):
 ep_step+=1
            action = agent.predict_action(state) # predict action
 next_state, reward, done, _ = env.step(action) 
            state = next_state 
 ep_reward += reward 
 if done:
 break
 steps.append(ep_step)
 rewards.append(ep_reward)
 print(f"回合:{i_ep+1}/{cfg['test_eps']},奖励:{ep_reward:.2f}")
 print("完成测试!")
 env.close()
 return {'episodes':range(len(rewards)),'rewards':rewards,'steps':steps}

3. Define the environment

In fact, OpenAI Gym integrates a lot of reinforcement learning environments, which are enough for everyone to learn, but it is inevitable to create the environment by yourself in the application of reinforcement learning. For example, in this project, it is not easy to find the environment that Qlearning can learn. Qlearning is really It is too weak and needs a simple enough environment. Therefore, this project has written an environment. If you are interested, you can take a look. The most critical parts of the general environment interface are reset and step.

import gym
import paddle
import numpy as np
import random
import os
from parl.algorithms import DQN
def all_seed(env,seed = 1):
 ''' omnipotent seed for RL, attention the position of seed function, you'd better put it just following the env create function
 Args:
        env (_type_): 
        seed (int, optional): _description_. Defaults to 1.
    '''
 print(f"seed = {seed}")
 env.seed(seed) # env config
 np.random.seed(seed)
 random.seed(seed)
 paddle.seed(seed)
def env_agent_config(cfg):
 ''' create env and agent
    '''
    env = gym.make(cfg['env_name']) 
 if cfg['seed'] !=0: # set random seed
 all_seed(env,seed=cfg["seed"]) 
 n_states = env.observation_space.shape[0] # print(hasattr(env.observation_space, 'n'))
 n_actions = env.action_space.n # action dimension
 print(f"n_states: {n_states}, n_actions: {n_actions}")
 cfg.update({"n_states":n_states,"n_actions":n_actions}) # update to cfg paramters
    model = MLP(n_states,n_actions)
 algo = DQN(model, gamma=cfg['gamma'], lr=cfg['lr'])
    memory = ReplayBuffer(cfg["memory_capacity"]) # replay buffer
    agent = DQNAgent(algo,memory,cfg) # create agent
 return env, agent

4. Setting parameters

Even if all the qlearning modules are completed here, some parameters need to be set below to facilitate everyone's "alchemy". The default is that the author has adjusted it ~. In addition, a drawing function is defined to describe the change of the reward.

import argparse
import seaborn as sns
import matplotlib.pyplot as plt
def get_args():
 """ 
    """
    parser = argparse.ArgumentParser(description="hyperparameters") 
 parser.add_argument('--algo_name',default='DQN',type=str,help="name of algorithm")
 parser.add_argument('--env_name',default='CartPole-v0',type=str,help="name of environment")
 parser.add_argument('--train_eps',default=200,type=int,help="episodes of training") # 训练的回合数
 parser.add_argument('--test_eps',default=20,type=int,help="episodes of testing") # 测试的回合数
 parser.add_argument('--ep_max_steps',default = 100000,type=int,help="steps per episode, much larger value can simulate infinite steps")
 parser.add_argument('--gamma',default=0.99,type=float,help="discounted factor") # 折扣因子
 parser.add_argument('--epsilon_start',default=0.95,type=float,help="initial value of epsilon") #  e-greedy策略中初始epsilon
 parser.add_argument('--epsilon_end',default=0.01,type=float,help="final value of epsilon") # e-greedy策略中的终止epsilon
 parser.add_argument('--epsilon_decay',default=200,type=int,help="decay rate of epsilon") # e-greedy策略中epsilon的衰减率
 parser.add_argument('--memory_capacity',default=200000,type=int) # replay memory的容量
 parser.add_argument('--memory_warmup_size',default=200,type=int) # replay memory的预热容量
 parser.add_argument('--batch_size',default=64,type=int,help="batch size of training") # 训练时每次使用的样本数
 parser.add_argument('--targe_update_fre',default=200,type=int,help="frequency of target network update") # target network更新频率
 parser.add_argument('--seed',default=10,type=int,help="seed") 
 parser.add_argument('--lr',default=0.0001,type=float,help="learning rate")
 parser.add_argument('--device',default='cpu',type=str,help="cpu or gpu") 
 args = parser.parse_args([]) 
 args = {**vars(args)} # type(dict)         
 return args
def smooth(data, weight=0.9): 
 '''用于平滑曲线,类似于Tensorboard中的smooth
 Args:
        data (List):输入数据
        weight (Float): 平滑权重,处于0-1之间,数值越高说明越平滑,一般取0.9
    Returns:
        smoothed (List): 平滑后的数据
    '''
    last = data[0] # First value in the plot (first timestep)
    smoothed = list()
 for point in data:
 smoothed_val = last * weight + (1 - weight) * point  # 计算平滑值
 smoothed.append(smoothed_val) 
        last = smoothed_val 
 return smoothed
def plot_rewards(rewards,cfg,path=None,tag='train'):
 sns.set()
 plt.figure() # 创建一个图形实例,方便同时多画几个图
 plt.title(f"{tag}ing curve on {cfg['device']} of {cfg['algo_name']} for {cfg['env_name']}")
 plt.xlabel('epsiodes')
 plt.plot(rewards, label='rewards')
 plt.plot(smooth(rewards), label='smoothed')
 plt.legend()

5. Training

# 获取参数
cfg = get_args() 
# 训练
env, agent = env_agent_config(cfg)
res_dic = train(cfg, env, agent)
plot_rewards(res_dic['rewards'], cfg, tag="train") 
# 测试
res_dic = test(cfg, env, agent)
plot_rewards(res_dic['rewards'], cfg, tag="test") # 画出结果

seed = 10
n_states: 4, n_actions: 2
开始训练!
环境:CartPole-v0,算法:DQN,设备:cpu
回合:10/200,奖励:10.00,Epislon: 0.062
回合:20/200,奖励:85.00,Epislon: 0.014
回合:30/200,奖励:41.00,Epislon: 0.011
回合:40/200,奖励:31.00,Epislon: 0.010
回合:50/200,奖励:22.00,Epislon: 0.010
回合:60/200,奖励:10.00,Epislon: 0.010
回合:70/200,奖励:10.00,Epislon: 0.010
回合:80/200,奖励:22.00,Epislon: 0.010
回合:90/200,奖励:30.00,Epislon: 0.010
回合:100/200,奖励:20.00,Epislon: 0.010
回合:110/200,奖励:15.00,Epislon: 0.010
回合:120/200,奖励:45.00,Epislon: 0.010
回合:130/200,奖励:73.00,Epislon: 0.010
回合:140/200,奖励:180.00,Epislon: 0.010
回合:150/200,奖励:167.00,Epislon: 0.010
回合:160/200,奖励:200.00,Epislon: 0.010
回合:170/200,奖励:165.00,Epislon: 0.010
回合:180/200,奖励:200.00,Epislon: 0.010
回合:190/200,奖励:200.00,Epislon: 0.010

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

Guess you like

Origin blog.csdn.net/devcloud/article/details/131373083