[Reinforcement Learning] One of the commonly used algorithms "SAC"

Author's homepage: boy who loves to laugh. Blog_CSDN blog - deep learning, activities, python field blogger loves to laugh boy. A boy who is good at deep learning, activities, python, etc., and loves to laugh. Focus on algorithms, python, computer vision, image processing, deep learning, pytorch, neural network, opencv fields. https://blog.csdn.net/Code_and516?type=blog Personal profile: Dagong.

Continue to share: machine learning, deep learning, python-related content, daily BUG solutions, and Windows&Linux practical tips.

If you find an error in the article, please point it out, and I will correct it in time. If you have other needs, you can private message me or send me an email: [email protected]

Reinforcement Learning is a method of machine learning that develops strategies by allowing agents to learn from interactions with the environment to maximize the expected cumulative reward. The SAC (Soft Actor-Critic) algorithm is a reinforcement learning algorithm that combines policy optimization and value function learning to achieve robust sampling optimization of continuous action spaces.

This article will explain in detail one of the commonly used algorithms for reinforcement learning, "SAC".

Table of contents

1. Introduction

2. History

3. Algorithm formula

Policy update formula:

Q value function update formula:

Value function update formula:

4. Algorithm principle

1. Strategy optimization

2. Value function learning

3. Entropy optimization

4. Adaptive temperature parameters

5. Algorithm function

6. Example code

7. Summary

1. Introduction

Reinforcement Learning (RL) is a branch of machine learning whose goal is to allow the agent (agent) to learn the optimal behavior strategy through interaction with the environment. The SAC (Soft Actor-Critic) algorithm is one of the algorithms that have made important breakthroughs in the field of reinforcement learning in recent years. It is an algorithm based on policy optimization and value function learning. Compared with the traditional reinforcement learning algorithm, the SAC algorithm introduces the concept of entropy regularization and softening policy update in the optimization process, so that the agent can better explore the unknown state and improve the learning efficiency.

2. History

The development of SAC algorithm is inseparable from the work of predecessors. Before introducing the SAC algorithm, we first understand some related algorithms.

1. DQN (Deep Q-Networks) DQN is a reinforcement learning algorithm proposed by DeepMind, which combines deep neural network with Q-Learning for the first time.

By using experience replay and target network to improve the stability of learning, the DQN algorithm has achieved excellent results in many benchmark tests.

2. DDPG (Deep Deterministic Policy Gradient) DDPG is a deep reinforcement learning algorithm for continuous action spaces, which combines deep neural networks and deterministic policy gradients.

The DDPG algorithm has achieved good performance in continuous control problems and is widely used in practical applications.

3. The predecessor of the SAC algorithm The predecessor of the SAC algorithm includes the TD3 (Twin Delayed DDPG) and DDPG algorithms.

The TD3 algorithm introduces dual networks and delayed updates based on the DDPG algorithm, which further improves the performance of the algorithm. The SAC algorithm is further expanded on the basis of the TD3 algorithm, and technologies such as entropy optimization and adaptive temperature parameters are introduced to adapt to more complex tasks.

The SAC algorithm was first proposed by Haarnoja et al. in 2018 and published in the journal "Journal of Machine Learning Research". The algorithm combines the Actor-Critic method and the concept of entropy in reinforcement learning, and provides a more efficient and stable solution to continuous control tasks in reinforcement learning.

3. Algorithm formula

The SAC algorithm is mainly composed of the following core formulas:

Policy update formula:

Among them, ∇θpolicyJ(θpolicy) represents the gradient of the policy, π(a∣s) represents the probability of the policy taking action a in state s, and Qπ(s,a) represents the state-action pair (s,a) Value function, α represents the entropy adjustment coefficient, and V~(s) represents the value function of softening.

Q value function update formula:

Among them, Q(s, a) represents the value function of state-action pair (s, a), r(s, a) represents the immediate reward obtained when action a is taken in state s, γ represents the discount factor, V(s′) Represents the value function of state s's', p(s' | s, a) represents the transition probability of transitioning to state s' after taking action a in state s.

Value function update formula:

Among them, V(s) represents the value function of the state ss, and a ∼ π represents the action a sampled from the policy π.

4. Algorithm principle

The SAC algorithm employs a series of techniques to achieve robust sampling optimization in continuous action space. The main principle of the SAC algorithm is introduced below:

1. Strategy optimization

The SAC algorithm uses a policy gradient method for optimization. By maximizing the objective function of the soft Q value, the SAC algorithm can effectively sample in the continuous action space to improve sampling efficiency and optimize performance.

2. Value function learning

The SAC algorithm introduces the learning of the value function, and by learning the value function, the value of the state-action pair can be estimated more accurately. The learning of the value function can be realized by minimizing the Bellman error, which further improves the performance of the algorithm.

3. Entropy optimization

The SAC algorithm optimizes a policy by minimizing its entropy. Entropy is an indicator to measure the uncertainty of a strategy. By minimizing the entropy of the strategy, the strategy can be more balanced and diversified. This helps to improve the adaptability of the algorithm to different environments and tasks.

4. Adaptive temperature parameters

The SAC algorithm introduces an adaptive temperature parameter α. By optimizing the selection of the temperature parameter, a balance can be achieved between maximizing the expected cumulative reward and minimizing the policy entropy. Adaptive temperature parameters can better adapt to different tasks and environments and improve the performance of the algorithm.

5. Algorithm function

The SAC algorithm has the following main functions in reinforcement learning tasks:

Support continuous action space: SAC algorithm is suitable for tasks dealing with continuous action space, such as robot control, unmanned driving, etc.
Efficient and stable policy update: By introducing an entropy adjustment item and a soft value function, the SAC algorithm can improve the exploration and stability of the policy without reducing efficiency.
Better learning performance: Compared with traditional reinforcement learning algorithms, SAC algorithms can usually achieve better learning performance in continuous control tasks.
Flexible parameter setting: parameters such as the entropy adjustment coefficient and softening value function of the SAC algorithm can be flexibly adjusted according to the needs of the task to obtain the best performance.

6. Example code

Below is a sample code for training and testing an inverted pendulum environment using the reinforcement learning SAC algorithm. The code can be executed after installing the inverted pendulum environment (Pendulum-v0) or other environments suitable for the SAC algorithm in OpenAI Gym.

import gym
import random
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Normal
from torch.utils.data import DataLoader, Dataset


class ReplayBuffer(Dataset):
    def __init__(self, capacity):
        self.buffer = []
        self.capacity = capacity

    def __len__(self):
        return len(self.buffer)

    def push(self, state, action, reward, next_state, done):
        if len(self.buffer) >= self.capacity:
            self.buffer.pop(0)
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        batch = zip(*random.sample(self.buffer, batch_size))
        return [torch.tensor(i) for i in batch]


class ValueNetwork(nn.Module):
    def __init__(self, state_dim):
        super(ValueNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, 256)
        self.fc2 = nn.Linear(256, 256)
        self.fc3 = nn.Linear(256, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x


class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, max_action):
        super(PolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, 256)
        self.fc2 = nn.Linear(256, 256)
        self.mean = nn.Linear(256, action_dim)
        self.log_std = nn.Linear(256, action_dim)
        self.max_action = max_action

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        mean = self.mean(x)
        log_std = self.log_std(x).clamp(-20, 2)
        return mean, log_std

    def sample(self, state):
        mean, log_std = self.forward(state)
        std = log_std.exp()
        normal = Normal(mean, std)
        action = normal.rsample()
        return action.clamp(-self.max_action, self.max_action), normal.log_prob(action).sum(1)


class SACAgent:
    def __init__(self, state_dim, action_dim, max_action):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.value_net = ValueNetwork(state_dim).to(self.device)
        self.target_value_net = ValueNetwork(state_dim).to(self.device)
        self.target_value_net.load_state_dict(self.value_net.state_dict())
        self.policy_net = PolicyNetwork(state_dim, action_dim, max_action).to(self.device)
        self.replay_buffer = ReplayBuffer(capacity=1000000)
        self.value_optimizer = optim.Adam(self.value_net.parameters(), lr=3e-4)
        self.policy_optimizer = optim.Adam(self.policy_net.parameters(), lr=3e-4)
        self.value_criterion = nn.MSELoss()

    def update_value_network(self, states, actions, rewards, next_states, masks):
        next_actions, next_log_probs = self.policy_net.sample(next_states)
        next_values = self.target_value_net(next_states)
        q_targets = rewards + masks * (next_values - next_log_probs.exp())
        values = self.value_net(states)
        loss = self.value_criterion(values, q_targets.detach())
        self.value_optimizer.zero_grad()
        loss.backward()
        self.value_optimizer.step()

    def update_policy_network(self, states):
        actions, log_probs = self.policy_net.sample(states)
        values = self.value_net(states)
        q_values = values - log_probs.exp()
        policy_loss = (log_probs.exp() * (log_probs - q_values).detach()).mean()
        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()

    def update_target_network(self):
        self.target_value_net.load_state_dict(self.value_net.state_dict())

    def train(self, env, num_episodes, batch_size, update_interval):
        state = env.reset()
        episode_rewards = []
        for episode in range(num_episodes):
            episode_reward = 0
            done = False
            while not done:
                action, _ = self.policy_net.sample(torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device))
                next_state, reward, done, _ = env.step(action.cpu().detach().numpy()[0])
                self.replay_buffer.push(state, action.cpu().detach().numpy()[0], reward, next_state, float(done))
                state = next_state
                episode_reward += reward
                if len(self.replay_buffer) > batch_size:
                    states, actions, rewards, next_states, masks = self.replay_buffer.sample(batch_size)
                    self.update_value_network(states.float().to(self.device),
                                              actions.float().to(self.device),
                                              rewards.float().unsqueeze(1).to(self.device),
                                              next_states.float().to(self.device),
                                              masks.float().unsqueeze(1).to(self.device))
                    if episode % update_interval == 0:
                        self.update_policy_network(states.float().to(self.device))
                        self.update_target_network()
            episode_rewards.append(episode_reward)
        return episode_rewards

    def test(self, env):
        state = env.reset()
        done = False
        episode_reward = 0
        while not done:
            action, _ = self.policy_net.sample(torch.tensor(state, dtype=torch.float32).unsqueeze(0).to(self.device))
            state, reward, done, _ = env.step(action.cpu().detach().numpy()[0])
            episode_reward += reward
        return episode_reward


if __name__ == "__main__":
    env_name = "Pendulum-v0"
    env = gym.make(env_name)
    env.seed(0)
    torch.manual_seed(0)

    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.shape[0]
    max_action = float(env.action_space.high[0])

    agent = SACAgent(state_dim, action_dim, max_action)

    num_episodes = 100
    batch_size = 128
    update_interval = 10

    episode_rewards = agent.train(env, num_episodes, batch_size, update_interval)
    test_reward = agent.test(env)

    print("Training rewards:", episode_rewards)
    print("Test reward:", test_reward)

This code first defines a replay buffer (Replay Buffer) class for storing and sampling experiences. After that, the classes of Value Network and Policy Network are defined, which are used to estimate the value function and policy respectively. Next is a SACAgent class, which contains methods such as update value function network, update policy network and update target value function network. Then define the training and testing methods, where the training method will perform multiple episodes in the environment and return the cumulative reward of each episode; the testing method is used to evaluate the performance of the trained strategy in the environment.

When running the sample code, you need to install the OpenAI Gym and PyTorch libraries first. According to the actual environment, parameters such as env_name, num_episodes, batch_size and update_interval can be modified. In the running results, "Training rewards" is the cumulative reward for each training episode, and "Test reward" is the cumulative reward for running the trained strategy in the test environment.

operation result

Training rewards: [-1653.2674326245028, -4.761833854079121, -5.92794045978663, -7.101895383837817, -8.203949829019429, -9.320596188422504, -10.398472530688595, -11.046385714744188, -10.069612464051666, -9.028488437838597, -7.656846467478978, -6.751302759291316, -5.892224950031628, -4.932040818022195, -4.404335946243107, -3.9543475318455914, -3.8004235924909593, -3.8954312087615484, -4.121609662371389, -4.645552707416158, -5.194625548020546, -6.270942803647476, -7.722571387132912, -9.49117141815922, -10.748767915311705, -11.837523420567333, -10.51287854951289, -8.911409206767225, -7.3159242910765805, -6.26554445728115, -5.318318410816599, -4.47352859150234, -3.705487907578077, -3.155346120863036, -2.6655070443703384, -2.5458468110930834, -2.7734881221702694, -3.1021955848735714, -3.9183340756372385, -4.677010046229791, -5.57281093988401, -6.5885638098856845, -7.691982718183524, -9.510764926014309, -10.809366474064687, -11.987368416541688, -10.57040679863866, -9.250008035195474, -7.908586504443504, -6.220578988348704, -4.8460643338024765, -4.060980241950622, -3.405435529895923, -2.767329044940599, -2.511189533487366, -2.4275672225189084, -2.454642944293755, -2.5937254351057217, -3.1160835151897968, -4.058114436352538, -5.445887904623622, -6.620130141605474, -7.949470581770992, -9.310201166829376, -11.434984365444118, -12.219258381790816, -10.891645129483637, -9.486480025372442, -8.059946018495705, -6.6809631024851495, -4.991482855801217, -3.7126215715421353, -3.031910380007442, -2.374267357519335, -2.0286805007142283, -2.0474943313467784, -2.4227627809752352, -3.191653624713721, -4.2051864164440875, -5.190187599031304, -6.332166895519481, -7.600904756549318, -8.942357396564006, -10.40428240474939, -11.714269430490143, -10.826518362820941, -9.66884676107395, -8.464936630889763, -6.899476506182678, -5.903640338789183, -4.751731696723347, -4.017007527711459, -3.5796759436048413, -3.328303909157216, -3.4151482609755326, -3.8343294110510615, -4.676829653734708, -5.442567944257961, -6.859903604078736, -8.648312542545764]
Test reward: -1207.2040024164842

7. Summary

This article introduces the SAC (Soft Actor-Critic) algorithm in reinforcement learning in detail, including its development history, algorithm formula, principle, function and sample code. The SAC algorithm is a reinforcement learning algorithm based on policy optimization and value function learning. By introducing the concept of entropy adjustment and softening value function, the agent can better explore unknown states and optimize strategies. The sample code shows the process of using the SAC algorithm to solve the CartPole problem, and gradually improves the control performance of the agent by training the agent to interact with the environment. The SAC algorithm has good performance and stability in continuous control tasks, and it is expected to be applied to more complex reinforcement learning tasks in the future.