DQN with Target code implementation

DQN with Target code implementation

  • Use tensorflow to implement DQN using replay buffer and target network.

Code and explanation

1. Hyperparameter setting

import argparse
parser = argparse.ArgumentParser()  
parser.add_argument('--train', dest='train', default=True)  
parser.add_argument('--test', dest='test', default=False)  
# parser.add_argument('--train', dest='train', default=False)  
# parser.add_argument('--test', dest='test', default=True)  

parser.add_argument('--gamma', type=float, default=0.95)  
parser.add_argument('--lr', type=float, default=0.005)  
parser.add_argument('--batch_size', type=int, default=128)  
parser.add_argument('--eps', type=float, default=0.2)  
parser.add_argument('--train_episodes', type=int, default=10000)  
parser.add_argument('--test_episodes', type=int, default=10)  
args = parser.parse_args()  
argparse module
  • argparse is a built-in module in Python for command option and parameter parsing. By defining the parameters we need in the program, argparse will parse these parameters from sys.argv and automatically generate help and usage information.
  • The use of the argparse module usually includes the following four steps
    • import argparse first imports the module
    • parser = argparse.ArgumentParser() creates a parsing object
    • parser.add_argument() adds the command line arguments and options you want to pay attention to to this object
    • args = parser.parse_args() for parsing
  • Commonly used parameters of add_argument
    • The first parameter is the name
    • type(parameter type)
    • default(parameter initial value)
    • choices sets the range of parameter values
    • dest-Sets the attribute into which the value of this option is parsed and placed.
parser.add_argument('--abs',type=int,default=10,dest = 'world')  
args = parser.parse_args()  
print('read in %s'%(args.world))  #read in 10

2.ReplayBuffer implementation

import random  
import numpy as np

class ReplayBuffer:  
	def __init__(self, capacity=10000):  
		self.capacity = capacity  
		self.buffer = []
		#buffer满了之后要从头开始循环利用
		self.position = 0  
	
	def push(self, state, action, reward, next_state, done):  
		if len(self.buffer) < self.capacity:  
			self.buffer.append(None)  
		self.buffer[self.position] = (state, action, reward, next_state, done)  
		self.position = int((self.position + 1) % self.capacity)  

	def sample(self, batch_size = args.batch_size): 
		#从buffer里随机抽batch_size个transition出来
		batch = random.sample(self.buffer, batch_size)
		#把这batch_size个transition分门别类放在几个数组里
		state, action, reward, next_state, done = map(np.stack, zip(*batch))  
		return state, action, reward, next_state, done
zip function
  • The zip() function takes an iterable object as a parameter, packs the corresponding elements in the object into tuples, and then returns a list composed of these tuples.
a = [1,2,3]  
b = [4,5,6,7]   
for z in zip(a, b):  
	print(z)
输出:
(1, 4)
(2, 5)
(3, 6)
The function of *
  • The * sign is mainly used to pass function parameters
  • In the function definition, the * sign is used to pack parameters
def f(*args):  
	print(args)  
f(1,2,3,4)
#输出:(1, 2, 3, 4)
  • In function calls, the * sign is used to unpack parameters
def f(a,b,c,d):  
	print(a,b,c,d)  
datas = [1,3,4,5]  
f(*datas)
#输出:1 3 4 5
The role of np.stack
  • The function of np.stack() is mainly used to stack a series of arrays in a new dimension .
  • np.stack() mainly has two parameters: one is arrays, the other is axis, and axis defaults to 0.
    • axis=0 means stringing each array one by one in rows, and axis=1 means stringing each array one by one in columns.
    • Each array must have the same shape.
  • When axis=0
a = [1,2,3]  
b = [4,5,6]  
c = [7,9,9]  
d = np.stack((a,b,c), axis=0)  
print(d)

输出:
[[1 2 3]
 [4 5 6]
 [7 9 9]]
  • When axis=1
a = [1,2,3]  
b = [4,5,6]  
c = [7,9,9]  
d = np.stack((a,b,c), axis=1)  
print(d)

输出:
[[1 4 7]
 [2 5 9]
 [3 6 9]]
map function
  • The map(func, iterable) function applies func to each element in iterable (where iterable must be an iterable object), and returns its return value into a new iterable object .
def power(x):  
	return x*x  
List = [1,2,3]  
print(list(map(power, List)))
#输出:[1, 4, 9]
map(np.stack, zip(*batch))Summary
  • Batch is an array composed of transitions (done is omitted): [ ( s 1 , a 1 , r 1 , s 2 ) , ( s 2 , a 2 , r 2 , s 3 ) , . . . , ( st , at , rt , st + 1 ) ] [(s_1,a_1,r_1,s_2),(s_2,a_2,r_2,s_3),...,(s_t,a_t,r_t,s_{t+1})][(s1,a1,r1,s2),(s2,a2,r2,s3),...,(st,at,rt,st+1)]
  • The * sign unpacks the batch, removes the list brackets (ie []), and passes these tuples to the zip function. The zip function returns: [ ( s 1 , s 2 , . . . , st ) , ( a 2 , a 2 , . . . , at ) , ( r 1 , r 2 , . . . , rt ) , ( s 2 , s 3 , . . . , st + 1 ) ] [(s_1,s_2,...,s_t ),(a_2,a_2,...,a_t),(r_1,r_2,...,r_t),(s_2,s_3,...,s_{t+1})][(s1,s2,...,st),(a2,a2,...,at),(r1,r2,...,rt),(s2,s3,...,st+1)]
    • The object returned by the zip function is not a list, but an iterable object, here for the sake of appearance.
  • The map function makes np.stack act on each element of the iterable object returned by zip, strings the arrays one by one in 0 dimensions, and finally uses state, action, reward, next_state, done to connect, and returns the map on the right The iterable object is cast to a list.

3. Implementation of Agent class

  • The Agent class mainly implements 8 methods.
    • _ init _: Initialize the agent.
    • target_update: used to update the target network.
    • choose_action: Choose action.
    • replay: Update the value function using gradient descent.
    • test_episode: used to test the model.
    • train: used to collect parameters required for training the model.
    • saveModel: Save the model.
    • loadModel: Load model.
3.1. _init_
  • Initialize the agent.
import tensorflow as tf  
import tensorlayer as tl

def __init__(self, env):
	#cnt用于使target network隔一段时间更新一次
	self.cnt = 0

	self.env = env  
	self.state_dim = self.env.observation_space.shape[0]
	#只有离散动作空间才有self.env.action_space.n属性
	self.action_dim = self.env.action_space.n  

	def create_model(input_state_shape):  
		input_layer = tl.layers.Input(input_state_shape)
		#第一层有n_units个神经结点,激活函数是Relu
		layer_1 = tl.layers.Dense(n_units=64, act=tf.nn.relu)(input_layer)  
		layer_2 = tl.layers.Dense(n_units=32, act=tf.nn.relu)(layer_1)  
		output_layer = tl.layers.Dense(n_units=self.action_dim)(layer_2)
		return tl.models.Model(inputs=input_layer, outputs=output_layer)  

	self.model = create_model([None, self.state_dim])  
	self.target_model = create_model([None, self.state_dim])
	#设置模型为训练模式
	self.model.train()  
	#设置模型为评估模式
	self.target_model.eval()  
	self.model_optim = self.target_model_optim = tf.optimizers.Adam(lr=args.lr)  

	self.epsilon = args.eps  

	self.buffer = ReplayBuffer()
  • train mode: Enable BatchNormalization and Dropout, and enable train mode during training.
  • eval mode: BatchNormalization and Dropout are not enabled, and eval mode is enabled during evaluation (or testing).
  • input_state_shape: Here the input_state_shape parameter is set to [None, self.state_dim], which is equivalent to (1, self.state_dim).
import numpy as np  
a = np.array([1,2,3,4])  
b = a.reshape([1,4])  
print(b)
#输出:[[1 2 3 4]]

The state shape provided by env is a one-dimensional array. If it is placed in a matrix, it will be a column of the matrix (that is, a vector). When the model reads data, it usually reads batch_size. The shape of the states read by the model should be is(batch_size,self.state_dim).

import gym  
env = gym.make('LunarLander-v2')  
state = env.reset()  
print(state)
#输出:[ 0.00677748  1.4212346   0.6864661   0.45840305 -0.00784658 -0.15549478
  0.          0.        ]
3.2. target_update
def target_update(self):  
	"""Copy q network to target q network"""  
	for weights, target_weights in zip(  
			self.model.trainable_weights, self.target_model.trainable_weights):  
		target_weights.assign(weights)
3.3. choose_action
def choose_action(self, state):  
	if np.random.uniform() < self.epsilon:  
		return np.random.choice(self.action_dim)  
	else:  
		q_value = self.model(state[np.newaxis, :])[0]  
		return np.argmax(q_value)
  • np.random.uniform(low=0,high=1.0), generates random numbers, the default range is [0,1]
  • The choose_action function first generates a random number in the range [0,1]. If the random number is less than ε, it will be explored. Otherwise, the value function will be used to evaluate the current state and the action with the largest q value will be selected.
  • The function of [np.newaxis, :] is to add a new dimension at the position of np.newaxis. Here state is a vector with shape (,state.dim). After adding dimension 0, it becomes (1,state. dim)-dimensional vector.
  • [0] is added after the model because only one state is input at this time, so the result only returns the q_value of a set of actions.
  • The function of np.argmax is to find the largest number in the array and return the subscript.
3.4. replay
  • In the replay function, the update of value network parameters is mainly completed, and it is also where "Cuda" is mainly used for calculation in this code.
def replay(self):  
	for _ in range(10):  
		# sample an experience tuple from the dataset(buffer)  
		states, actions, rewards, next_states, done = self.buffer.sample()  
		# compute the target value for the sample tuple  
		# targets [batch_size, action_dim]  
		target = self.model(states).numpy()
		# next_q_values [batch_size, action_dim]  
		next_target = self.target_model(next_states)  
		next_q_value = tf.reduce_max(next_target, axis=1)  
		target[range(args.batch_size), actions] = rewards + (1 - done) * args.gamma * next_q_value  

		# use sgd to update the network weight  
		with tf.GradientTape() as tape:  
			q_pred = self.model(states)  
			loss = tf.losses.mean_squared_error(target, q_pred)  
		grads = tape.gradient(loss, self.model.trainable_weights)  
		self.model_optim.apply_gradients(zip(grads, self.model.trainable_weights))
  • The function of tf.reduce_max is to find the maximum value. This function is a bit strange. axis=0 refers to calculating the maximum value of each column of the matrix, and axis=1 calculates the maximum value of the row. next_q_value is a one-dimensional array, each item corresponds to the maximum Q value that can be obtained at each next_state moment.
  • The target should be calculated using the model instead of the target network (the reference code uses the target network), because the subsequent calculation of the loss function requires subtracting the target and q_pred, which are related to the actions (actions is a one-dimensional array that saves each transition Action taken) irrelevant items should be 0. The TD target has been saved in the target array according to actions, so that it can be compared with q_pred later.
  • self.model(states) cannot be numpy, because after numpy it becomes a constant array.
3.5. test_episode
  • In the test_episode function, the model is tested several times and the results of each run are saved as a gif file.
def test_episode(self, test_episodes):  
	for episode in range(test_episodes):  
		state = self.env.reset().astype(np.float32)  
		total_reward, done = 0, False  
		frames = []  
		while not done:  
			# action = self.model(np.array([state], dtype=np.float32))[0]  
			# action = np.argmax(action) 
			action = self.choose_action(state)  
			next_state, reward, done, _ = self.env.step(action)  
			next_state = next_state.astype(np.float32)  

			total_reward += reward  
			state = next_state  
			# self.env.render()  
			frames.append(self.env.render(mode = 'rgb_array'))  
		print("Test {} | episode rewards is {}".format(episode, total_reward))  
		#将本场游戏保存为gif  
		dir_path = os.path.join('testVideo', '_'.join([ALG_NAME, ENV_ID]))  
		if not os.path.exists(dir_path):  
			os.makedirs(dir_path)  
		display_frames_as_gif(frames, dir_path + '\\' + str(episode) + ".gif")
  • How to save gym running process as gif file?
from matplotlib import animation  
import matplotlib.pyplot as plt

#第一步:定义帧画面转化为gif的函数
def display_frames_as_gif(frames, path):  
	patch = plt.imshow(frames[0])  
	plt.axis('off')  

	def animate(i):  
		patch.set_data(frames[i])  

	anim = animation.FuncAnimation(plt.gcf(), animate, frames=len(frames), interval=5)  
	anim.save(path, writer='pillow', fps=30)
	
#第二步:定义一个frames,用于收集游戏过程中的画面
frames = []  

#第三步:在游戏运行过程中,收集画面
frames.append(self.env.render(mode = 'rgb_array'))  

#第四部:游戏运行完毕后,将frames中的内容保存为gif
dir_path = os.path.join('testVideo', '_'.join([ALG_NAME, ENV_ID]))  
if not os.path.exists(dir_path):  
	os.makedirs(dir_path)  
display_frames_as_gif(frames, dir_path + '\\' + str(episode) + ".gif")
3.6. train
def train(self, train_episodes=200):  
	if args.train:  
		self.loadModel()  
		for episode in range(train_episodes):  
			total_reward, done = 0, False  
			state = self.env.reset().astype(np.float32)  
			print("开始玩游戏")  
			while not done:  
				action = self.choose_action(state)  
				next_state, reward, done, _ = self.env.step(action)  
				next_state = next_state.astype(np.float32)  
				#训练了几千盘后,agent总是变得非常谨慎,不肯着陆,因此要增大悬浮在空中的代价
                reward -= 0.1
				self.buffer.push(state, action, reward, next_state, done)  
				total_reward += reward  
				state = next_state  
				# self.env.render()  
			print("游戏结束")
			#如果replay buffer里的transition达到最低限度
			if len(self.buffer.buffer) > args.batch_size:  
				#使用梯度下降更新价值函数
				self.replay()  
				#每更新10次价值函数就更新一次target  
				if self.cnt%10 == 0:  
					self.target_update()  
				self.cnt = (self.cnt + 1) % 10  

			print('EP{} EpisodeReward={}'.format(episode, total_reward))  
			# if episode % 10 == 0:  
			#     self.test_episode() 
			if episode%100==0:  
				self.saveModel()  
		# self.saveModel()  
	if args.test:  
		self.loadModel()  
		self.test_episode(test_episodes=args.test_episodes)
  • In the original code, the target network is updated every time the value function is updated, so the target network is too similar to the value function, so the target is updated every 10 value function updates.
3.7. saveModel
import os
def saveModel(self):  
	path = os.path.join('model', '_'.join([ALG_NAME, ENV_ID]))  
	if not os.path.exists(path):  
		os.makedirs(path)  
	tl.files.save_weights_to_hdf5(os.path.join(path, 'model.hdf5'), self.model)  
	tl.files.save_weights_to_hdf5(os.path.join(path, 'target_model.hdf5'), self.target_model)  
	print('Saved weights.')
3.8. loadModel
def loadModel(self):  
	path = os.path.join('model', '_'.join([ALG_NAME, ENV_ID]))  
	if os.path.exists(path):  
		print('Load DQN Network parametets ...')  
		tl.files.load_hdf5_to_weights_in_order(os.path.join(path, 'model.hdf5'), self.model)  
		tl.files.load_hdf5_to_weights_in_order(os.path.join(path, 'target_model.hdf5'), self.target_model)  
		print('Load weights!')  
	else: print("No model file find, please train model first...")

4. Main program

import gym

#算法名称
ALG_NAME = 'DQN'  
#游戏名称
# ENV_ID = 'CartPole-v1'  
ENV_ID = 'LunarLander-v2'

if __name__ == '__main__':  
	env = gym.make(ENV_ID)  
	agent = Agent(env)  
	agent.train(train_episodes=args.train_episodes)  
	env.close()

Training results

After training 2000 times

  • After training for 2,000 times, you can basically land safely, but you may not be able to land at the target point.
    figure 1
    Insert image description here
    Insert image description here

idea

  • In the train function, playing games and updating the value function are serial, so most of the time is spent playing games. The "CUDA" utilization rate during the training process is only 4%. Is it possible to use multi-threading to make the process of playing games and updating the network more efficient? Process separation?
  • The initial ε used was 0.1. After training for about 5,000 games, the spacecraft was able to maintain a stable state in the air, but it was too cautious and descended very slowly. After changing ε to 0.2, the agent's desire to explore is increased. After increasing the cost of staying in the air in the train function, the agent's overly cautious attitude is improved.

Guess you like

Origin blog.csdn.net/weixin_40735291/article/details/120482471