Hands on RL 之 Deep Deterministic Policy Gradient(DDPG)

Hands on RL 之 Deep Deterministic Policy Gradient(DDPG)

1. Theoretical part

1.1 Review of Deterministic Policy Gradient (DPG)

Before introducing DDPG, let us first review the most important conclusions in DPG,

Deterministic Policy Gradient Theorem is the deterministic policy gradient theorem

∇ θ J ( μ θ ) = ∫ S ρ μ ( s ) ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) ds = E s ∼ ρ μ [ ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) ] \begin{aligned} \nabla_\theta J(\mu_\theta) & = \int_{\mathcal{S}}\; rho^\mu(s) \nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)} \mathrm{d}s \\ & = \mathbb{E}_{s\sim\rho^\mu} \Big[ \nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)|_{a=\mu_ \theta(s)} \Big] \end{aligned}iJ(μi)=Srμ (s)imi(s)aQm (s,a)a = mi(s)ds=Esρm[imi(s)aQm (s,a)a = mi(s)]

Among them, a = μ θ ( s ) a=\mu_\theta(s)a=mi( s ) represents a deterministic strategy that is a mapping from state space to action spaceμ θ : S → A \mu_\theta: \mathcal{S}\to\mathcal{A}mi:SA , the parameters of the network areθ \thetaθs ∼ ρ μ s\sim\rho^\musrμ represents the statesss conforms to the strategyμ \muState access distribution under μ . How it is derived will not be elaborated here. (You can refer toDeterministic policy gradient algorithms)

Next, we will introduce the improvements of DDPG compared to DPG point by point.

1.2 Neural Network Difference

DDPG is also very different in network structure compared to the traditional AC algorithm. First, let’s look at the network structure of the traditional algorithm.

Image

Then look at the network structure of DDPG

Image

Why does DDPG have such a network structure? This is because the actors in DDPG output deterministic actions instead of probability distributions of actions. Therefore, deterministic actions are continuous and can be regarded as having infinite dimensions in the action space. If Using the critic structure in AC, we cannot retrieve the Q-value corresponding to a specific action by traversing all actions. Therefore, in DDPG, the output of the actor is used as the input of the critic, and then combined with the state input, the action taken can be directly obtained a = μ (st) a=\mu(st)a=μ ( s t ) Q-value.

1.3 Why is off-policy?

​ First of all, why is DDPG or DPG off-policy? We review the stochastic policy π θ ( a ∣ s ) \pi_\theta(a|s)Pii(as)定义下的Q-value
Q π ( s t , a t ) = E r t , s t + 1 ∼ E [ r ( s t , a t ) + γ E a t + 1 ∼ π [ Q π ( s t + 1 , a t + 1 ) ] ] Q^\pi(s_t,a_t) = \mathbb{E}_{r_t, s_{t+1}\sim E}[r(s_t,a_t) + \gamma \mathbb{E}_{a_{t+1}\sim\pi}[Q^\pi(s_{t+1}, a_{t+1})]] Qπ (st,at)=Ert,st+1E[r(st,at)+c Eat+1π[Qπ (st+1,at+1)]]
Among them,EEE represents the environment, that is, states ∼ E s\sim EsThe E state conforms to the distribution of the environment itself. When we use a deterministic strategya = μ θ ( s ) a=\mu_\theta(s)a=mi( s ) , then the inner expectation is automatically canceled out
Q π ( st , at ) = E rt , st + 1 ∼ E [ r ( st , at ) + γ Q π ( st + 1 , at + 1 = μ ( st + 1 ) ) ] Q^\pi(s_t,a_t) = \mathbb{E}_{r_t, s_{t+1}\sim E}[r(s_t,a_t) + \gamma Q^\pi (s_{t+1}, a_{t+1}=\mu(s_{t+1}))]Qπ (st,at)=Ert,st+1E[r(st,at)+γQπ (st+1,at+1=m ( st+1))]
This means that Q-value no longer depends on the access distribution of the action, that is, there is noat + 1 ∼ π a_{t+1}\sim\piat+1π . Then we can use the behavioral policy behavior policyβ \betaβ produces the result to calculate this value, which makes off-policy possible.

Given that the Q-value is dependent on the quantitative efficiency, the equivalence of the equation is given by
θ J ( µ θ ) ≈ E s ∼ ρ β [ ∇ θ μ θ ( s ) ∇ a Q μ ( s , a ) ∣ a = μ θ ( s ) ] \textcolor{red}{\nabla_\theta J(\mu_\theta) \approx \mathbb{E}_{s\sim\rho^\beta} \Big[ \ . nabla_\theta \mu_\theta(s) \nabla_a Q^\mu(s,a)|_{a=\mu_\theta(s)} \Big]}iJ(μi)Esρb[imi(s)aQm (s,a)a = mi(s)]
can be written depending on behavior policyβ \betaThe expectation of the state access distribution generated by β is a form of off-policy.

1.4 Soft target update

​ Four neural networks are maintained in DDPG, namely policy network, target policy network, action value network, and target action value network. The idea of ​​separating the target network and the training network in DQN is used, and the soft update method is used to more effectively maintain the stability during training. The soft update method is as follows
θ − ← τ θ + ( 1 − τ ) θ − \theta^- \leftarrow \tau \theta + (1-\tau)\theta^-it i+(1t ) i
Among them,θ − \theta^-i represents the target network parameters,θ \thetaθ represents the training network parameters,τ ≪ 1 \tau \ll 1t1τ \yearτ is a soft update parameter.

1.5 Maintain Exploration

​ Deterministic strategies are not exploratory. In order to maintain the exploratory nature of the strategy, we can add Gaussian noise to the output of the policy network, so that the output action value has a slight deviation to increase the exploratory nature of the network. Expressed mathematically,
μ ′ ( st ) = μ θ ( st ) + N \mu^\prime(s_t) = \mu_\theta(s_t) + \mathcal{N}m(st)=mi(st)+Nwhereμ ′ \mu^\
primem represents an exploratory strategy,N \mathcal{N}N represents Gaussian noise.

1.6 Other Techniques

DDPG also integrates some common techniques of other algorithms, such as Replay Buffer to generate independent and identically distributed samples, and Batch Normalization to preprocess data.

1.7 Pesudocode

The pseudo code is as follows

Image

2. Code practice

The environment we used gymin Pendulum-v1this experiment Pendulum-v1is a typical deterministic continuous action space environment. The overall code is as follows

import torch
import torch.nn as nn
import torch.nn.functional as F
import gym
import random
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
import collections

# Policy Network
class PolicyNet(nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim, action_bound):
        super(PolicyNet, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, action_dim)
        self.action_bound = action_bound
    
    def forward(self, observation):
        x = F.relu(self.fc1(observation))
        x = F.tanh(self.fc2(x))
        return x * self.action_bound

# Q Value Network
class QValueNet(nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(QValueNet, self).__init__()
        self.fc1 = nn.Linear(state_dim + action_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc_out = nn.Linear(hidden_dim, 1)
    
    def forward(self, x, a):
        cat = torch.cat([x, a], dim=1)    # 拼接状态和动作
        x = F.relu(self.fc1(cat))
        x = F.relu(self.fc2(x))
        return self.fc_out(x)

# Deep Deterministic Policy Gradient
class DDPG():
    def __init__(self, state_dim, hidden_dim, action_dim, 
                action_bound, actor_lr, critic_lr, 
                sigma, tau, gamma, device):
        self.actor = PolicyNet(state_dim, hidden_dim, action_dim, action_bound).to(device)
        self.critic = QValueNet(state_dim, hidden_dim, action_dim).to(device)
        self.target_actor = PolicyNet(state_dim, hidden_dim, action_dim, action_bound).to(device)
        self.target_critic = QValueNet(state_dim, hidden_dim, action_dim).to(device)

        # initialize target actor network with same parameters
        self.target_actor.load_state_dict(self.actor.state_dict())
        # initialize target critic network with same parameters
        self.target_critic.load_state_dict(self.critic.state_dict())

        self.actor_optimizer = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)
        self.critic_optimizer = torch.optim.Adam(self.critic.parameters(), lr=critic_lr)
        self.gamma = gamma
        self.sigma = sigma  # 高斯噪声的标准差,均值直接设置为0
        self.action_dim = action_dim
        self.device = device
        self.tau = tau
    
    def take_action(self, state):
        state = torch.tensor(np.array([state]), dtype=torch.float).to(self.device)
        action = self.actor(state).item()
        # add noise to increase exploratory
        action = action + self.sigma * np.random.randn(self.action_dim)
        return action
    
    def soft_update(self, net, target_net):
        # implement soft update rule
        for param_target, param in zip(target_net.parameters(), net.parameters()):
            param_target.data.copy_(param_target.data * (1.0-self.tau) + param.data * self.tau)
    
    def update(self, transition_dict):
        states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device)
        rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1,1).to(self.device)
        actions = torch.tensor(transition_dict['actions'], dtype=torch.float).view(-1,1).to(self.device)
        next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device)
        dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1,1).to(self.device)

        next_q_values = self.target_critic(next_states, self.target_actor(next_states))
        td_targets = rewards + self.gamma * next_q_values * (1-dones)
        critic_loss = torch.mean(F.mse_loss(self.critic(states, actions), td_targets))

        self.critic_optimizer.zero_grad()
        critic_loss.backward()
        self.critic_optimizer.step()

        actor_loss = torch.mean( - self.critic(states, self.actor(states)))
        self.actor_optimizer.zero_grad()
        actor_loss.backward()
        self.actor_optimizer.step()

        # soft update actor net and critic net
        self.soft_update(self.actor, self.target_actor)
        self.soft_update(self.critic, self.target_critic)
    

class ReplayBuffer():
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)
    
    def add(self, s, a, r, s_, d):
        self.buffer.append((s,a,r,s_,d))
    
    def sample(self, batch_size):
        transitions = random.sample(self.buffer, batch_size)
        states, actions, rewards, next_states, dones = zip(*transitions)
        return np.array(states), actions, np.array(rewards), np.array(next_states), dones

    def size(self):
        return len(self.buffer)


def train_off_policy_agent(env, agent, num_episodes, replay_buffer, minimal_size, batch_size, render, seed_number):
    return_list = []
    for i in range(10):
        with tqdm(total=int(num_episodes/10), desc='Iteration %d'%(i+1)) as pbar:
            for i_episode in range(int(num_episodes/10)):
                observation, _ = env.reset(seed=seed_number)
                done = False
                episode_return = 0

                while not done:
                    if render:
                        env.render()
                    action = agent.take_action(observation)
                    observation_, reward, terminated, truncated, _ = env.step(action)
                    done = terminated or truncated
                    replay_buffer.add(observation, action, reward, observation_, done)
                    # swap states
                    observation = observation_
                    episode_return += reward
                    if replay_buffer.size() > minimal_size:
                        b_s, b_a, b_r, b_ns, b_d = replay_buffer.sample(batch_size)
                        transition_dict = {
    
    
                            'states': b_s,
                            'actions': b_a,
                            'rewards': b_r,
                            'next_states': b_ns,
                            'dones': b_d
                        }
                        agent.update(transition_dict)
                return_list.append(episode_return)
                if(i_episode+1) % 10 == 0:
                    pbar.set_postfix({
    
    
                        'episode': '%d'%(num_episodes/10 * i + i_episode + 1),
                        'return': "%.3f"%(np.mean(return_list[-10:]))
                    })
                pbar.update(1)
    env.close()
    return return_list

def moving_average(a, window_size):
    cumulative_sum = np.cumsum(np.insert(a, 0, 0)) 
    middle = (cumulative_sum[window_size:] - cumulative_sum[:-window_size]) / window_size
    r = np.arange(1, window_size-1, 2)
    begin = np.cumsum(a[:window_size-1])[::2] / r
    end = (np.cumsum(a[:-window_size:-1])[::2] / r)[::-1]
    return np.concatenate((begin, middle, end))

def plot_curve(return_list, mv_return, algorithm_name, env_name):
    episodes_list = list(range(len(return_list)))
    plt.plot(episodes_list, return_list, c='gray', alpha=0.6)
    plt.plot(episodes_list, mv_return)
    plt.xlabel('Episodes')
    plt.ylabel('Returns')
    plt.title('{} on {}'.format(algorithm_name, env_name))
    plt.show()



if __name__ == "__main__":
    # reproducible
    seed_number = 0
    random.seed(seed_number)
    np.random.seed(seed_number)
    torch.manual_seed(seed_number)

    num_episodes = 250     # episodes length
    hidden_dim = 128        # hidden layers dimension
    gamma = 0.98            # discounted rate
    actor_lr = 1e-3         # lr of actor
    critic_lr = 1e-3        # lr of critic
    tau = 0.005             # soft update parameter
    sigma = 0.01            # std variance of guassian noise
    buffer_size = 10000
    minimal_size = 1000
    batch_size = 64

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    env_name = 'Pendulum-v1'

    render = False
    if render:
        env = gym.make(id=env_name, render_mode='human')
    else:
        env = gym.make(id=env_name)
                    
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.shape[0]  
    action_bound = env.action_space.high[0]


    replay_buffer = ReplayBuffer(buffer_size)        
    agent = DDPG(state_dim, hidden_dim, action_dim, action_bound, actor_lr, critic_lr, sigma, tau, gamma, device)
    return_list = train_off_policy_agent(env, agent, num_episodes, replay_buffer, minimal_size, batch_size, render, seed_number)

    mv_return = moving_average(return_list, 9)
    plot_curve(return_list, mv_return, 'DDPG', env_name)

The return curve of DDPG training is shown in the figure

Image

Reference

Tutorial: Hands on RL

Paper: Continuous control with deep reinforcement learning

Guess you like

Origin blog.csdn.net/qq_44940689/article/details/132307203