Application of DRL in computer vision, machine learning and other fieldsDeep Reinforcement Learning for Atari Games

Author: Zen and the Art of Computer Programming

1 Introduction

Reinforcement Learning (RL) is a revolutionary and popular direction in the field of machine learning. In recent years, researchers have made major breakthroughs in this field and achieved amazing achievements, which have also inspired, inspired and motivated many learners and engineers. However, due to the complexity of reinforcement learning and the huge algorithm space, not everyone can well understand its working mechanism, principles, functions and possible problems. Therefore, how to better disseminate and utilize the knowledge of Deep Reinforcement Learning (DRL) is a topic worthy of attention.
This article will take the 2013 George Washington University Atari game console project as an example to explain the application of DRL in Atari games, discuss the research progress of DRL at different stages and its applicable scenarios, and explore the application of DRL in the fields of computer vision, machine learning, etc. broad prospects and give some specific plans and suggestions. I hope that through elaboration, analysis and practice, I can provide some valuable reference for readers.

2. Explanation of basic concepts and terms

2.1 Concept description

Reinforcement Learning (RL) is a type of machine learning method used to solve decision-making problems. It relies on the interaction between the agent and the environment. The agent continuously explores the environment with a certain strategy and obtains rewards (ie, expected returns), thereby updating the strategy according to the performance of the strategy to make it better and better. Its basic idea is to build an agent system to perform self-learning so that tasks can be completed within a limited time. A key feature of reinforcement learning is that the agent does not need to know complete information about the environment in advance. It only needs to observe the information that the agent can perceive in the environment and the rewards for performing actions. Compared with other machine learning algorithms, reinforcement learning is unique in its ability to allow the agent to adapt to the environment and choose the best action. Therefore, reinforcement learning is considered a new type of optimization algorithm.
In the reinforcement learning process, the relationship between the agent and the environment can be divided into three main aspects: state, action, and reward. The state describes the current environmental state of the agent; the action is the behavior used by the agent to affect environmental changes and is the external input given; the reward is the reward obtained by the agent when it performs an action. In RL, at each iteration, the agent will take an action and feedback a reward to the environment, and then update its action strategy based on the feedback, and finally reach the optimal solution.
The core problem of RL is to learn how to maximize cumulative reward. At each iteration, the agent tries different action strategies to obtain the maximum reward. In order to improve efficiency, the agent will use random strategy (exploration) trial and error within a certain training period, and gradually transform to the optimal strategy (exploitation). During training, the agent needs to continuously explore and at the same time ensure better performance under known conditions. The two complement each other.

2.2 Explanation of terminology

2.2.1 DQN

2.2.2 DPG

2.2.3 DDPG

2.2.4 A2C

2.2.5 PPO

2.2.6 TRPO

2.2.7 ACER

3. Explanation of core algorithm principles, specific operating steps and mathematical formulas

This section will introduce in detail the basic principles, characteristics and design ideas of DQN, DPG, DDPG, A2C, PPO, TRPO and ACER, combined with specific operating steps and mathematical formulas to help readers better understand their working mechanisms.

3.1 DQN

DQN (Deep Q-Network) is a research study by the DeepMind team in 2013. Its core idea is to use neural networks to automatically learn state transfer functions, and use neural networks as Q-Function to achieve an end-to-end reinforcement learning process. Its characteristic is to directly predict the target output (action) in a complete state without using function approximation methods. In addition, DQN reduces the number of parameters by reusing the network structure, effectively reducing computing resource consumption. At present, DQN has become a mainstream model in Atari games and has shown remarkable results in two consecutive major versions (version 7 and version 9).
In Atari games, each game is composed of several screens. Each screen represents a set of RGB pixels. The entire game screen can be obtained by superimposing multiple such screens. For a given agent, the only information it receives is the image information of the current screen and the list of actions that the agent can perform. The goal of DQN is to use this information to learn a mapping and convert the pixels on each screen into corresponding actions, so that the agent can perform appropriate actions in the game.
First, DQN consists of two components: experience pool (replay memory) and neural network. Among them, the experience pool is used to store game data, including training data, game status (screen), reward (reward), whether it is over (done), etc. The neural network is a Q-network, which consists of two parts, namely the feature network and the action network. The feature network accepts the current screen image as input, extracts useful features, and sends them to the action network for processing. The action network receives the output of the feature network, generates the Q value corresponding to each action, and then selects the optimal action based on the Q value.
Secondly, the algorithm framework adopted by DQN is completely based on Q-Learning. Q-learning is a value iteration algorithm that represents the problem as a Markov Decision Process (MDP), in which the interaction between the agent and the environment is defined as an MDP in the state-action space, and the goal is to find a state action The value function (State Action Value Function, Q-function) allows the agent to select the optimal action in subsequent states. DQN draws on the mathematical principles of Q-learning to model the state transition probability as a value network and the action selection probability as a policy network. By minimizing the Bellman equation in Q-learning, DQN can automatically learn the optimal strategy based on the experience pool.
Finally, DQN uses the automatic learning ability of neural networks to improve learning efficiency, such as using reused network structures to reduce the number of parameters, fixing the target network to reduce learning risks, using the goals of the target network to reduce update frequency, etc. Through these techniques, DQN showed remarkable results on two consecutive major versions (version 7 and version 9).

3.2 DPG

DPG (Deterministic Policy Gradient) is a research study by the DeepMind team in 2016. Its core idea is to use reinforcement learning methods to train agents that can take into account both fast response and stability. The algorithm flow of DPG is similar to that of DQN, but the policy parameters are constrained so that the agent can only choose certain actions. Its main purpose is to overcome the cold start problem of DQN. Currently, DPG has been proven to be useful in some complex control problems, such as robot motion planning.
First of all, the main difference between DPG and DQN is that the policy network in DPG cannot use Q values to select actions, but directly outputs a deterministic action distribution. Secondly, DPG also adds a regularization term to limit the deviation of the strategy. Finally, DPG replaces the gradient ascent step size of the policy network with the gradient ascent step size in adversarial training, which alleviates convergence difficulties to a certain extent.

3.3 DDPG

DDPG (Deep Deterministic Policy Gradient) is a research study by the DeepMind team in 2016. It is characterized by combining the advantages of DQN and DPG and proposing a deep deterministic policy network to replace the previous shallow network. Its main purpose is to overcome the local minimum problems encountered by DQN or DPG, improve stability and accelerate convergence. Currently, DDPG has been proven to be useful in some complex control problems, such as robot motion planning.
Like DPG, the policy network in DDPG cannot use Q values to select actions, but directly outputs a deterministic action distribution. The main difference is that DDPG constructs an objective function based on the target network and combines the experience playback method in DQN to enhance the training samples. Finally, the update step size of DDPG is set to a fixed step size based on adversarial training, that is, the network can always converge.

3.4 A2C

A2C (Asynchronous Advantage Actor Critic, asynchronous advantage actor-critic) is a research study by the DeepMind team in 2016. Its core idea is to propose an asynchronous SGD algorithm and combine A3C (Asynchronous Methods for Deep Reinforcement Learning, asynchronous deep reinforcement learning) The idea of asynchronous method) is introduced into DQN. The algorithm process of A2C includes collecting data, updating the policy network, evaluating the policy network, updating parameters and saving the model, which is basically the same as DQN. Its main purpose is also to overcome the local fluctuation problem caused by DQN's single-step sample update method, improve performance and accelerate convergence. Currently, A2C has been proven to be useful in some complex control problems, such as robot motion planning.
Like DQN, A2C consists of two components: experience pool (replay memory) and neural network. The experience pool is used to store game data, including training data, game status (screen), reward (reward), whether it is over (done), etc. The neural network consists of two parts, namely the feature network and the action network. The feature network receives the current screen image as input, extracts useful features, and sends them to the action network for processing. The action network receives the output of the feature network, generates the Q value corresponding to each action, and then selects the optimal action based on the Q value.
However, the difference between A2C and DQN is that A2C introduces multiple threads for parallel processing in the algorithm process to improve efficiency. Specifically, when A2C collects data, it can collect data from multiple trajectories in parallel and store them in the experience pool in a unified manner. When updating the neural network, A2C uses the asynchronous SGD algorithm, which allows multiple agents to be trained in parallel and improves the concurrency capability of the algorithm. In addition, A2C also introduced the advantage actor-critic method when updating the policy network to help the model learn a better action value function, thereby improving performance. Finally, A2C uses a greedy policy to make decisions, which is neither entirely based on model predictions nor entirely dependent on historical data, thereby maximizing the exploration factor to improve the robustness of the model.

3.5 PPO

PPO (Proximal Policy Optimization, proximal policy optimization) is a research study by the OpenAI team in 2017. Its core idea is to control the policy search range by setting the KL divergence limit in the loss function to achieve a balance between exploration and exploitation. balance. The basic process is almost the same as the previous model, but an entropy penalty term is added to control the diversity of strategies, thereby improving learning efficiency. PPO has been proven to be useful in some complex control problems such as robot motion planning.
Unlike DQN, A2C, and DDPG, PPO does not have a clear objective function, but uses dynamic KL constraints to adjust the parameters of the policy network. The specific approach is that PPO uses kl divergence to measure the similarity of two strategies, and sets a super parameter λ to control the size of kl divergence. When λ is too large, the kl divergence limit is too wide, causing the policy network to converge slowly; when λ is too small, the kl divergence limit is too thin, causing the model to be unable to generalize to new environments. Therefore, PPO finds a suitable balance point by dynamically adjusting the value of λ.
In addition, PPO also introduces first order dynamics loss to reduce the drop in sample efficiency.

3.6 TRPO

TRPO (Trust Region Policy Optimization) is a study conducted by the Stanford University team in 2015. Its core idea is to avoid falling into a local optimal solution by adding restrictions within the policy search range. TRPO utilizes KL divergence in reinforcement learning to represent the connections between different actions and improves exploration efficiency by controlling the size of the parameter space. The basic process is almost the same as the previous model, but a penalty term is added to control the diversity of strategies to prevent the model from overfitting. TRPO has been proven to be useful in some complex control problems such as robot motion planning.
Unlike DQN, A2C, DDPG, and PPO, TRPO does not have a clear objective function, but uses control constraints to adjust the parameters of the policy network. The specific method is that in each iteration, TRPO first calculates the kl divergence of the current policy, and sets a hyperparameter δ according to the size of the kl divergence to control the update amplitude of the parameters of the policy network. When δ is too large, the restrictions are too thin, resulting in the model being too conservative; when δ is too small, the restrictions are too wide, resulting in the model being unable to explore the global optimal solution. Therefore, TRPO finds a suitable balance point by dynamically adjusting the value of δ.

3.7 ACER

ACER (Actor-Critic with Experience Replay) is a research study by the Stanford University team in 2016. Its core idea is to improve the Q-learning method based on temporal difference and propose an experience replay method. method to enhance training samples. The algorithm flow of ACER is basically the same as that of DQN, but it introduces the experience playback method to make full use of the information in memory and reduce the decline in sample efficiency. Currently, ACER has been proven to be useful in some complex control problems, such as robot motion planning.
Unlike DQN, A2C, DDPG, PPO, and TRPO, ACER does not have a clear objective function. Instead, it combines the advantages of DQN and proposes an Advantage Actor-Critic (A2C) scheme. ACER's advantageous actor-critic architecture consists of two parts, namely the actor network and the critic network. The actor network receives the current screen image as input, generates a probability distribution corresponding to each action, and then selects the optimal action based on the probability distribution. The critic network accepts the current screen image as input, calculates the Q-value of each action together with the actor network, and optimizes for the Q-value. ACER also uses experience replay to enhance training samples.

4. Specific code examples and explanations

The algorithm model introduced above only provides an outline. The actual operation also involves many details, such as the selection of algorithm hyperparameters, preparation of data sets, loading of model parameters, recording of sample data, visualization of results, and sampling during training. Patterns, inference implementations, etc. Here are a few specific code examples for your reference:

4.1 DQN

The following is a specific code example of DQN, written using PyTorch.

import torch
import torch.nn as nn
import gym
from collections import deque
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
class DQN(nn.Module):
def __init__(self, num_inputs, num_outputs):
  super(DQN, self).__init__()
   self.fc = nn.Sequential(
      nn.Linear(num_inputs, 128),
      nn.ReLU(),
      nn.Linear(128, 128),
      nn.ReLU(),
      nn.Linear(128, num_outputs)
  )
def forward(self, x):
  return self.fc(x)
env = gym.make('CartPole-v0')
# set up the model
model = DQN(env.observation_space.shape[0], env.action_space.n)
optimizer = torch.optim.Adam(model.parameters())
loss_fn = nn.MSELoss()
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device:", device)
model = model.to(device)
def train(model, optimizer, loss_fn, experience):
state, action, next_state, reward, done = experience
state = torch.FloatTensor(np.float32(state)).unsqueeze(0).to(device)
next_state = torch.FloatTensor(np.float32(next_state)).unsqueeze(0).to(device)
action = torch.LongTensor(action).unsqueeze(0).to(device)
reward = torch.FloatTensor([reward]).unsqueeze(0).to(device)
done = torch.FloatTensor([done]).unsqueeze(0).to(device)
# get Q(s',a) and best action a' using target net
q_values_next = model(next_state)
_, actions_next = q_values_next.max(dim=1)
q_values_next_target = target_net(next_state)
q_value_next_target = q_values_next_target.gather(1, actions_next.unsqueeze(1)).squeeze(-1)
# compute target value y
target = (q_value_next * GAMMA) + (reward + (GAMMA ** N_STEP) * q_value_next_target * (not done))
# predict current Q values
q_values = model(state)
q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(-1)
# calculate loss between predicted Q value and actual label
loss = loss_fn(q_value, target)
# optimize the model
optimizer.zero_grad()
loss.backward()
optimizer.step()
def run_episode():
episode_rewards = []
state = env.reset()
while True:
  # select an action based on epsilon greedy strategy
  eps_threshold = EPS_END + (EPS_START - EPS_END) * \
      math.exp(-1. * steps_done / EPS_DECAY)
  if random.random() < eps_threshold:
      action = env.action_space.sample()
  else:
      state = torch.FloatTensor(np.float32(state)).unsqueeze(0).to(device)
      q_values = model(state)
      _, action = q_values.max(1)
      action = int(action.item())
      
  # perform the selected action in the environment
  next_state, reward, done, _ = env.step(action)
  
  # store the transition in the replay buffer
  exp = (state, action, next_state, reward, done)
  replay_buffer.append(exp)
          
  # update the state and step count
  state = next_state
  steps_done += 1
  
  # train the model after every C steps
  if len(replay_buffer) > BATCH_SIZE and steps_done % C == 0:
      for i in range(TRAIN_FREQ):
          batch = random.sample(replay_buffer, BATCH_SIZE)
          train(model, optimizer, loss_fn, batch)
          
              
  episode_rewards.append(reward)
  if done or steps_done >= MAX_STEPS:
      break
      
return sum(episode_rewards)
  
      
if __name__ == '__main__':
NUM_EPISODES = 200
REWARDS = []
# initialize target network to same parameters as online network
target_net = DQN(env.observation_space.shape[0], env.action_space.n)
target_net.load_state_dict(model.state_dict())
  
for ep in range(NUM_EPISODES):
  rewards = run_episode()
  REWARDS.append(rewards)
  
  print('[Episode {}/{}] Reward {}'.format(ep+1, NUM_EPISODES, rewards))
  
  if ep % TARGET_UPDATE == 0:
      target_net.load_state_dict(model.state_dict())
          
# plot the total reward per episode
plt.plot(REWARDS)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.show()

4.2 A2C

The following is a specific code example of A2C, written using PyTorch.

import os
import time
import torch
import torch.nn as nn
import gym
from tensorboardX import SummaryWriter
import numpy as np
import math
class ActorCritic(nn.Module):
"""
This class implements the ACTOR CRITIC NETWORK used by the A2C algorithm. It takes as input 
the size of the observations space and outputs two vectors of length n_actions representing
the probability distribution over possible actions and the expected value of each action respectively.
"""
def __init__(self, observation_size, hidden_size, n_actions):
  super().__init__()
  self.hidden_size = hidden_size
  self.actor = nn.Sequential(
              nn.Linear(observation_size, hidden_size),
              nn.Tanh(),
              nn.Linear(hidden_size, hidden_size),
              nn.Tanh(),
              nn.Linear(hidden_size, n_actions)
          )
  self.critic = nn.Sequential(
              nn.Linear(observation_size, hidden_size),
              nn.Tanh(),
              nn.Linear(hidden_size, hidden_size),
              nn.Tanh(),
              nn.Linear(hidden_size, 1)
          )
  
def forward(self, x):
  """
  Forward pass through both actor and critic networks. Returns tuple consisting of 
  the actor output (action probabilities) and the critic output (expected value of each action).
  """
  probs = F.softmax(self.actor(x), dim=-1)
  value = self.critic(x)
  return probs, value
class A2CAgent:
"""This is a single agent that interacts with the environment."""
def __init__(self, name, obs_size, act_size, gamma, lr, entropy_coef, max_steps, log_interval, seed):
  self.name = name
  self.obs_size = obs_size
  self.act_size = act_size
  self.gamma = gamma
  self.lr = lr
  self.entropy_coef = entropy_coef
  self.max_steps = max_steps
  self.log_interval = log_interval
  self.seed = seed
  self.training_mode = False
def choose_action(self, obs, training=True):
  """Choose an action given an observation"""
  self.training_mode = training
  self.model.train(training)
  with torch.no_grad():
      obs = torch.tensor(obs, dtype=torch.float32).unsqueeze(0).to(self.device)
      prob = self.model.actor(obs)[0].detach().cpu().numpy()
      dist = Categorical(prob)
      action = dist.sample().item()
  return action
def learn(self, rollout):
  """Update policy using the given rollout"""
  obs_batch, acts_batch, rews_batch, vals_batch, dones_batch = map(
      lambda x: torch.cat(x, dim=0).to(self.device),
      zip(*rollout)
  )
  advantages = rews_batch - vals_batch[:-1]
  
  probs, vals = self.model(obs_batch)
  m_probs, m_vals = self.model(obs_batch[-1])
  last_val = m_vals.view(-1).item()
  discounted_rews = utils.discount_rewards(rews_batch + [last_val], self.gamma)[:-1]
  
  val_loss = ((discounted_rews - vals)**2).mean()
  entropy_loss = (-(m_probs*torch.log(probs))).sum()
  pol_loss = -(advantages.detach() * torch.log(probs)).mean()
  
  loss = pol_loss + val_loss + (self.entropy_coef * entropy_loss)
  loss.backward()
  nn.utils.clip_grad_norm_(self.model.parameters(), max_norm=0.5)
  self.optimizer.step()
  
  return {
    
    'pol_loss': pol_loss.item(), 
          'val_loss': val_loss.item(), 
          'entropy_loss': entropy_loss.item()}
def init_model(self, env, device='cpu'):
  """Initialize actor-critic neural networks"""
  self.device = device
  self.model = ActorCritic(self.obs_size, HIDDEN_SIZE, self.act_size).to(self.device)
  self.optimizer = optim.Adam(self.model.parameters(), lr=self.lr)
  
  
class RolloutStorage:
"""Stores rollouts data until they can be used to update a model"""
def __init__(self, num_steps, num_processes, obs_size, act_size):
  self.observations = torch.zeros(num_steps+1, num_processes, obs_size)
  self.actions = torch.zeros(num_steps, num_processes, 1).long()
  self.rewards = torch.zeros(num_steps, num_processes, 1)
  self.returns = torch.zeros(num_steps+1, num_processes, 1)
  self.masks = torch.ones(num_steps+1, num_processes, 1)
  self.index = 0
  self.num_steps = num_steps
  self.num_processes = num_processes
  self.obs_size = obs_size
  self.act_size = act_size
  
def insert(self, current_obs, action, reward, mask):
  """Insert new observation into the storage buffer"""
  self.observations[self.index+1].copy_(current_obs)
  self.actions[self.index].copy_(action)
  self.rewards[self.index].copy_(reward)
  self.masks[self.index+1].copy_(mask)
  self.index = (self.index + 1) % self.num_steps
  
def after_update(self):
  """Compute returns and clear out the buffers"""
  self._compute_returns()
  self.observations.zero_()
  self.actions.zero_()
  self.rewards.zero_()
  self.masks.fill_(1)
  self.index = 0
  
def compute_advantages(self, last_val=0.0):
  """Computes advantage estimates based on the current returns"""
  advs = self.returns[:-1] - self.values + last_val
  advs = (advs - advs.mean())/(advs.std()+1e-8)
  return advs
def feed_forward_generator(self, advantages, mini_batch_size):
  """Generates batches of data from stored rollout"""
  batch_size = self.num_processes * self.num_steps
  assert batch_size >= mini_batch_size, "Batch size should be greater than or equal to sample size"
  indices = torch.randperm(batch_size).tolist()
  for start_idx in range(0, batch_size, mini_batch_size):
      end_idx = start_idx + mini_batch_size
      sampled_indices = indices[start_idx:end_idx]
      yield self._get_samples(sampled_indices, advantages)
      
      
def _get_samples(self, indices, advantages):
  """Retrieves samples according to the specified indices"""
  obs_batch = self.observations[:-1].view(-1, *self.obs_size)[indices]
  act_batch = self.actions.view(-1, 1)[indices]
  ret_batch = self.returns[:-1].view(-1, 1)[indices]
  adv_batch = advantages.view(-1, 1)[indices]
  old_v_batch = self.values.view(-1, 1)[indices]
  old_p_batch = self.old_log_probs.view(-1, 1)[indices]
  return obs_batch, act_batch, ret_batch, adv_batch, old_v_batch, old_p_batch
def _compute_returns(self):
  """Computes returns recursively from the rewards"""
  R = 0
  self.returns[-1] = self.rewards
  for t in reversed(range(self.rewards.size(0))):
      R = self.gamma * R + self.rewards[t]
      self.returns[t] = R