[Deep Reinforcement Learning] 8. DDPG algorithm and some code analysis

[DataWhale clock in] DDPG algorithm Deep Deterministric Policy Gradient

Video reference from: https://www.bilibili.com/video/BV1yv411i7xd?p=19

1. Mind map

2. Detailed

DDPG is an algorithm to solve the continuity control problem, but unlike PPO, the PPO output is a strategy and a probability distribution. The output of DDPG is an action.

DDPG adopts Actor-Critic architecture and is improved based on DQN. The action space in DQN must be discrete, so it cannot deal with the problem of continuous action space. DDPG made changes on its basis, introducing an Actor Network, allowing the output from a network to get a continuous action space.

Compared AC DDPG
Actor The output is a probability distribution Output is action
Critic Estimated V value Estimated Q value
Update Gradient update with weights Gradient rise

When optimizing the Q network, if the Q-target is constantly changing, it will cause update difficulties. Similar to DQN, DDPG also adopts the method of fixing the network structure, first freeze the target network, update the parameters, and then assign the parameters to the target network. So what is needed are four networks:

  • actor
  • critic
  • target actor
  • target critic

As can be seen from the figure above, DDPG (also an Actor-Critic method) is actually a method of timing difference, combining Value-based and Policy-Based methods. Among them, Policy is Actor, which is used to give an action; the value function is Critic, which evaluates the quality of the Action given by the Actor, and generates a timing difference signal to guide the update of the value function and strategy function.

3. Code

The code mainly looks at the main modules of the DDPG algorithm:

3.1 Background

The problem to be solved by DDPG here is a pendulum problem, Pendulum-v0. In this version of the problem, the pendulum starts at a random position, and the goal is to swing it up to keep it upright. This is a problem of continuous control.

State representation:

Action space:

Reward evaluation:
− (θ 2 + 0.1 ∗ θ dt 2 + 0.001 ∗ action 2) -(\theta^2 + 0.1*\theta_{dt}^2 + 0.001*action^2)- ( θ2+0.1θdt2+0.001action2 ) It
can be seen that the goal is to maintain a zero angle, that is, to be vertical, while requiring the smallest rotation speed and the smallest force.

3.2 Actor

The role of Actor is to receive the state description and output an action. Because the action space requirement in DDPG is continuous, a tanh is used

class Actor(nn.Module):
    def __init__(self, n_obs, n_actions, hidden_size, init_w=3e-3):
        super(Actor, self).__init__()  
        self.linear1 = nn.Linear(n_obs, hidden_size)
        self.linear2 = nn.Linear(hidden_size, hidden_size)
        self.linear3 = nn.Linear(hidden_size, n_actions)
        
        self.linear3.weight.data.uniform_(-init_w, init_w)
        self.linear3.bias.data.uniform_(-init_w, init_w)
        
    def forward(self, x):
        x = F.relu(self.linear1(x))
        x = F.relu(self.linear2(x))
        x = F.tanh(self.linear3(x))
        return x

In terms of implementation, it is a network designed with several fully connected layers, and the output result is a continuous value.

3.3 Critic

Critic critics, in DDPG, accept an Action value from the Actor and the current state, and output the expectations about Q obtained after the Action is adopted in the current state.

class Critic(nn.Module):
    def __init__(self, n_obs, n_actions, hidden_size, init_w=3e-3):
        super(Critic, self).__init__()
        
        self.linear1 = nn.Linear(n_obs + n_actions, hidden_size)
        self.linear2 = nn.Linear(hidden_size, hidden_size)
        self.linear3 = nn.Linear(hidden_size, 1)
        # 随机初始化为较小的值
        self.linear3.weight.data.uniform_(-init_w, init_w)
        self.linear3.bias.data.uniform_(-init_w, init_w)
        
    def forward(self, state, action):
        # 按维数1拼接
        x = torch.cat([state, action], 1)
        x = F.relu(self.linear1(x))
        x = F.relu(self.linear2(x))
        x = self.linear3(x)
        return x

3.4 Replay Buffer

Replay Buffer is used to store a series of SARS fragments waiting to be learned.

class ReplayBuffer:
    def __init__(self, capacity):
        self.capacity = capacity
        self.buffer = []
        self.position = 0
    
    def push(self, state, action, reward, next_state, done):
        if len(self.buffer) < self.capacity:
            self.buffer.append(None)
        self.buffer[self.position] = (state, action, reward, next_state, done)
        self.position = (self.position + 1) % self.capacity
    
    def sample(self, batch_size):
        batch = random.sample(self.buffer, batch_size)
        state_batch, action_batch, reward_batch, next_state_batch, done_batch = map(np.stack, zip(*batch))
        return state_batch, action_batch, reward_batch, next_state_batch, done_batch
    
    def __len__(self):
        return len(self.buffer)

The capacity of the Replay Buffer can be set. The push function is to add a SARS segment to the buffer; sample represents sampling batch size segments from the buffer.

3.5 DDPG

DDPG uses all the above objects, including Critic, Target Critic, Actor, Target Actor, and memory.

The init function is as follows:

def __init__(self, n_states, n_actions, hidden_dim=30, device="cpu", critic_lr=1e-3,
                actor_lr=1e-4, gamma=0.99, soft_tau=1e-2, memory_capacity=100000, batch_size=128):
    self.device = device
    
    self.critic = Critic(n_states, n_actions, hidden_dim).to(device)
    self.actor = Actor(n_states, n_actions, hidden_dim).to(device)

    self.target_critic = Critic(n_states, n_actions, hidden_dim).to(device)
    self.target_actor = Actor(n_states, n_actions, hidden_dim).to(device)

    for target_param, param in zip(self.target_critic.parameters(), self.critic.parameters()):
        target_param.data.copy_(param.data)
    for target_param, param in zip(self.target_actor.parameters(), self.actor.parameters()):
        target_param.data.copy_(param.data)

    self.critic_optimizer = optim.Adam(
        self.critic.parameters(),  lr=critic_lr)
        
    self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=actor_lr)
    
    self.memory = ReplayBuffer(memory_capacity)

    self.batch_size = batch_size
    self.soft_tau = soft_tau
    self.gamma = gamma

The core function is the update function:

def update(self):
    if len(self.memory) < self.batch_size:
        return
    state, action, reward, next_state, done = self.memory.sample(
        self.batch_size)
    # 将所有变量转为张量
    state = torch.FloatTensor(state).to(self.device)
    next_state = torch.FloatTensor(next_state).to(self.device)
    action = torch.FloatTensor(action).to(self.device)
    reward = torch.FloatTensor(reward).unsqueeze(1).to(self.device)
    done = torch.FloatTensor(np.float32(done)).unsqueeze(1).to(self.device)
    # 注意critic将(s_t,a)作为输入
    policy_loss = self.critic(state, self.actor(state))
    
    policy_loss = -policy_loss.mean()

    next_action = self.target_actor(next_state)
    target_value = self.target_critic(next_state, next_action.detach())
    expected_value = reward + (1.0 - done) * self.gamma * target_value
    expected_value = torch.clamp(expected_value, -np.inf, np.inf)

    value = self.critic(state, action)
    value_loss = nn.MSELoss()(value, expected_value.detach())
    
    self.actor_optimizer.zero_grad()
    policy_loss.backward()
    self.actor_optimizer.step()

    self.critic_optimizer.zero_grad()
    value_loss.backward()
    self.critic_optimizer.step()
    for target_param, param in zip(self.target_critic.parameters(), self.critic.parameters()):
        target_param.data.copy_(
            target_param.data * (1.0 - self.soft_tau) +
            param.data * self.soft_tau
        )
    for target_param, param in zip(self.target_actor.parameters(), self.actor.parameters()):
        target_param.data.copy_(
            target_param.data * (1.0 - self.soft_tau) +
            param.data * self.soft_tau
        )

The overall process is as follows:

  • Sample a batch of data from memory.
  • policy_loss = self.critic(state, self.actor(state))
    • Put the state in the actor object to get the action
    • Put state and action on the critic object to get policy loss
next_action = self.target_actor(next_state)
target_value = self.target_critic(next_state, next_action.detach())
  • Then the target actor and target critic also get the target value according to the above process
  • Calculate the expected value based on the target value:

r + γ Q r+\gamma Q r+γQ

The implementation is as follows:

expected_value = reward + (1.0 - done) * self.gamma * target_value
expected_value = torch.clamp(expected_value, -np.inf, np.inf)

If done is 1, it means it is over and this coefficient is no longer needed. The second line puts a numerical restriction on the expected value.

  • Next, calculate the value obtained from the action in the data set.
value = self.critic(state, action)
  • Calculate the loss of the optimized Q network, using MSEloss
value_loss = nn.MSELoss()(value, expected_value.detach())

Compare the picture below:

  • Perform gradient backhaul of policy loss and value loss, and update training parameters.

The training results are as follows:

4. References

The code part comes from the realization of johnjim, thanks.

https://www.jianshu.com/p/af3a7853268f

https://datawhalechina.github.io/leedeeprl-notes/#/chapter12/project3

https://www.bilibili.com/video/BV1yv411i7xd?p=19

Guess you like

Origin blog.csdn.net/DD_PP_JJ/article/details/109551886