Table of contents
1. The main components of reinforcement learning
2. Python-based reinforcement learning framework
5. Use pytorch to implement DQN algorithm
1. The main components of reinforcement learning
Reinforcement learning is mainly composed of two parts: agent (agent) and environment (env) . During reinforcement learning, the agent interacts with the environment all the time. After the agent acquires a certain state in the environment , it will use this state to output an action (action) . Then this action will be executed in the environment, and the environment will output the next state and the reward brought by the current action according to the action taken by the agent . The agent's goal is to extract as many rewards from the environment as possible .
2. Python-based reinforcement learning framework
There are many python-based reinforcement learning frameworks. For details, see this blogger’s blog: (7 messages) [Reinforcement Learning/gym] (2) Some frameworks or codes for reinforcement learning_o0o_-_的博客-CSDN博客_ Interpretable reinforcement learning framework code The framework I used this time is pytorch, because the implementation of the DQN algorithm includes part of the neural network, which is more convenient for me to use pytorch, so I chose this.
3. gym
gym defines a set of interfaces for describing the concept of environments in reinforcement learning, and includes some implemented environments in its official library.
4. DQN Algorithm
Traditional reinforcement learning algorithms use Q tables to store state-value functions or action-value functions, but in practical applications, the environment in which the problem exists may have many states, or even countless, so in this case, discrete Q tables are used to store The value function will be very unreasonable, so the DQN (Deep Q-learning) algorithm uses a neural network to fit the action value function .
Usually, the DQN algorithm can only deal with discrete actions and continuous states. The neural network is used to fit the action value function , and then for the action value function, the action a with the largest Q value when the state is fixed is selected.
The DQN algorithm has two characteristics:
1. Experience playback
Each sample is placed in the sample pool, so a sample can be used repeatedly for repeated use. During training, multiple data samples are randomly selected for training.
2. Target network
The update target of the DQN algorithm is approximated , but if two Qs are calculated using one network, the target value of Q is constantly changing, which may easily cause instability in neural network training. DQN uses the target network, the target value Q is calculated using the target network during training, and the parameter timing of the target network is synchronized with the parameters of the training network.
5. Use pytorch to implement DQN algorithm
import time
import random
import torch
from torch import nn
from torch import optim
import gym
import numpy as np
import matplotlib.pyplot as plt
from collections import deque, namedtuple # 队列类型
from tqdm import tqdm # 绘制进度条用
device = torch. Device("cuda" if torch.cuda.is_available() else "cpu")
Transition = namedtuple('Transition', ('state', 'action', 'reward', 'next_state', 'done'))
1.replay memory
class ReplayMemory(object):
def __init__(self, memory_size):
self.memory = deque([], maxlen=memory_size)
def sample(self, batch_size):
batch_data = random.sample(self.memory, batch_size)
state, action, reward, next_state, done = zip(*batch_data)
return state, action, reward, next_state, done
def push(self, *args):
# *args: 把传进来的所有参数都打包起来生成元组形式
# self.push(1, 2, 3, 4, 5)
# args = (1, 2, 3, 4, 5)
self.memory.append(Transition(*args))
def __len__(self):
return len(self.memory)
2. Neural network part
class Qnet(nn.Module):
def __init__(self, n_observations, n_actions):
super(Qnet, self).__init__()
self.model = nn.Sequential(
nn.Linear(n_observations, 128),
nn.ReLU(),
nn.Linear(128, n_actions)
)
def forward(self, state):
return self.model(state)
3.Agent
class Agent(object):
def __init__(self, observation_dim, action_dim, gamma, lr, epsilon, target_update):
self.action_dim = action_dim
self.q_net = Qnet(observation_dim, action_dim).to(device)
self.target_q_net = Qnet(observation_dim, action_dim).to(device)
self.gamma = gamma
self.lr = lr
self.epsilon = epsilon
self.target_update = target_update
self.count = 0
self.optimizer = optim.Adam(params=self.q_net.parameters(), lr=lr)
self.loss = nn.MSELoss()
def take_action(self, state):
if np.random.uniform(0, 1) < 1 - self.epsilon:
state = torch.tensor(state, dtype=torch.float).to(device)
action = torch.argmax(self.q_net(state)).item()
else:
action = np.random.choice(self.action_dim)
return action
def update(self, transition_dict):
states = transition_dict.state
actions = np.expand_dims(transition_dict.action, axis=-1) # 扩充维度
rewards = np.expand_dims(transition_dict.reward, axis=-1) # 扩充维度
next_states = transition_dict.next_state
dones = np.expand_dims(transition_dict.done, axis=-1) # 扩充维度
states = torch.tensor(states, dtype=torch.float).to(device)
actions = torch.tensor(actions, dtype=torch.int64).to(device)
rewards = torch.tensor(rewards, dtype=torch.float).to(device)
next_states = torch.tensor(next_states, dtype=torch.float).to(device)
dones = torch.tensor(dones, dtype=torch.float).to(device)
# update q_values
# gather(1, acitons)意思是dim=1按行号索引, index=actions
# actions=[[1, 2], [0, 1]] 意思是索引出[[第一行第2个元素, 第1行第3个元素],[第2行第1个元素, 第2行第2个元素]]
# 相反,如果是这样
# gather(0, acitons)意思是dim=0按列号索引, index=actions
# actions=[[1, 2], [0, 1]] 意思是索引出[[第一列第2个元素, 第2列第3个元素],[第1列第1个元素, 第2列第2个元素]]
# states.shape(64, 4) actions.shape(64, 1), 每一行是一个样本,所以这里用dim=1很合适
predict_q_values = self.q_net(states).gather(1, actions)
with torch.no_grad():
# max(1) 即 max(dim=1)在行向找最大值,这样的话shape(64, ), 所以再加一个view(-1, 1)扩增至(64, 1)
max_next_q_values = self.target_q_net(next_states).max(1)[0].view(-1, 1)
q_targets = rewards + self.gamma * max_next_q_values * (1 - dones)
l = self.loss(predict_q_values, q_targets)
self.optimizer.zero_grad()
l.backward()
self.optimizer.step()
if self.count % self.target_update == 0:
# copy model parameters
self.target_q_net.load_state_dict(self.q_net.state_dict())
self.count += 1
4. Model training function
def run_episode(env, agent, repalymemory, batch_size):
state = env.reset()
reward_total = 0
while True:
action = agent.take_action(state)
next_state, reward, done, _ = env.step(action)
# print(reward)
repalymemory.push(state, action, reward, next_state, done)
reward_total += reward
if len(repalymemory) > batch_size:
state_batch, action_batch, reward_batch, next_state_batch, done_batch = repalymemory.sample(batch_size)
T_data = Transition(state_batch, action_batch, reward_batch, next_state_batch, done_batch)
# print(T_data)
agent.update(T_data)
state = next_state
if done:
break
return reward_total
def episode_evaluate(env, agent, render):
reward_list = []
for i in range(5):
state = env.reset()
reward_episode = 0
while True:
action = agent.take_action(state)
next_state, reward, done, _ = env.step(action)
reward_episode += reward
state = next_state
if done:
break
if render:
env.render()
reward_list.append(reward_episode)
return np.mean(reward_list).item()
def test(env, agent, delay_time):
state = env.reset()
reward_episode = 0
while True:
action = agent.take_action(state)
next_state, reward, done, _ = env.step(action)
reward_episode += reward
state = next_state
if done:
break
env.render()
time. Sleep(delay_time)
5. Training model
The environment used for model training is the CartPole game provided by gym (see here for details: Cart Pole - Gym Documentation (gymlibrary.dev) ). This environment is more classic. There are three requirements for the end of the car:
(1) The angle of the pole exceeds the degree
(2) The position of the car is greater than ±2.4 (the center of the car reaches the edge of the display screen)
(3) The number of moving steps of the car exceeds 200 (v1 is 500)
Every time the car takes a step, the reward will be +1, so in the v0 version environment, the maximum reward for an episode of the car is 200 .
if __name__ == "__main__":
# print("prepare for RL")
env = gym.make("CartPole-v0")
env_name = "CartPole-v0"
observation_n, action_n = env.observation_space.shape[0], env.action_space.n
# print(observation_n, action_n)
agent = Agent(observation_n, action_n, gamma=0.98, lr=2e-3, epsilon=0.01, target_update=10)
replaymemory = ReplayMemory(memory_size=10000)
batch_size = 64
num_episodes = 200
reward_list = []
# print("start to train model")
# 显示10个进度条
for i in range(10):
with tqdm(total=int(num_episodes/10), desc="Iteration %d" % i) as pbar:
for episode in range(int(num_episodes / 10)):
reward_episode = run_episode(env, agent, replaymemory, batch_size)
reward_list.append(reward_episode)
if (episode+1) % 10 == 0:
test_reward = episode_evaluate(env, agent, False)
# print("Episode %d, total reward: %.3f" % (episode, test_reward))
pbar.set_postfix({
'episode': '%d' % (num_episodes / 10 * i + episode + 1),
'return' : '%.3f' % (test_reward)
})
pbar.update(1) # 更新进度条
test(env, agent, 0.5) # 最后用动画观看一下效果
episodes_list = list(range(len(reward_list)))
plt.plot(episodes_list, reward_list)
plt.xlabel('Episodes')
plt.ylabel('Returns')
plt.title('Double DQN on {}'.format(env_name))
plt.show()
The training results are shown in the figure:
References: