[Reinforcement Learning] One of the commonly used algorithms "PPO"

 

Author's homepage: boy who loves to laugh. Blog_CSDN blog - deep learning, activities, python field blogger loves to laugh boy. A boy who is good at deep learning, activities, python, etc., and loves to laugh. Focus on algorithms, python, computer vision, image processing, deep learning, pytorch, neural network, opencv fields. https://blog.csdn.net/Code_and516?type=blog Personal profile: Dagong.

Continue to share: machine learning, deep learning, python-related content, daily BUG solutions, and Windows&Linux practical tips.

If you find an error in the article, please point it out, and I will correct it in time. If you have other needs, you can private message me or send me an email: [email protected] 

        Reinforcement Learning (Reinforcement Learning), as a branch of machine learning, aims to allow agents to learn optimal behavioral strategies through interaction with the environment. In recent years, reinforcement learning has made important breakthroughs in various fields, among which the Proximal Policy Optimization (PPO) algorithm is an important policy optimization algorithm.

This article will explain in detail one of the commonly used algorithms for reinforcement learning, "PPO".


Table of contents

1. Introduction

2. History

3. Algorithm Formula Explanation

        1. Objective function

        2. Surrogate objective function

        3. Update steps

4. Algorithm principle

5. Algorithm function

6. Example code

7. Summary


1. Introduction

        Reinforcement learning is a machine learning method that learns optimal behavioral strategies through the interaction of an agent with the environment. Compared with supervised learning and unsupervised learning, reinforcement learning is characterized by delayed reward and trial-and-error mechanism. In reinforcement learning, the agent influences the environment by choosing actions and receives rewards from the environment as feedback. The goal of reinforcement learning is to enable the agent to learn the optimal behavior strategy through the interaction with the environment.

        The PPO algorithm belongs to the Policy Optimization algorithm family and was proposed by OpenAI in 2017. Compared with other policy optimization algorithms, PPO algorithm has higher sample utilization rate and better convergence performance. The algorithm has shown good performance in distributed training and large-scale models, so it is widely used in various fields, such as robot control, automatic driving, games, etc.

2. History

        Before introducing the PPO algorithm, you need to understand some related algorithms. The PPO algorithm is an improvement based on the TRPO (Trust Region Policy Optimization) algorithm. The TRPO algorithm was originally proposed by Schulman et al. in 2015. It introduces constraints to ensure that the strategy of each update will not change too much, thereby ensuring the stability of the strategy. However, the computational complexity of the TRPO algorithm is high, which limits its application range.

        In order to solve the computational complexity problem of the TRPO algorithm, Schulman et al. proposed the PPO algorithm in 2017. The PPO algorithm replaces the relative entropy constraint in the TRPO algorithm by introducing a pruning probability ratio constraint. In this way, the computational complexity of the PPO algorithm is greatly reduced, making it more efficient in practical applications.

3. Algorithm Formula Explanation

        1. Objective function

        The goal of the PPO algorithm is to maximize the expected return function. Let the state be s, the action be a, the strategy function be π(a|s), the value function be V(s), and the reward function be R. The goal is to maximize the total reward function G for state transitions. According to the policy gradient theorem, the following objective function can be obtained:

J(θ)=E[R(θ)] =E[∑t=0∞γt rt]

        Among them, θ represents the policy parameter, and γ represents the discount factor.

        2. Surrogate objective function

        Since direct optimization of the objective function requires complex probability calculations, PPO uses an approximate optimization objective function. Introduce a policy-generated ratio of old and new policies, π(θ)/π(θ_old). Then the objective function can be transformed into:

J_surrogate(θ)=E[min(ratio(θ)A(θ), clip(ratio(θ), 1-ε, 1+ε)A(θ)]

        Among them, A(θ)=Q(s,a)-V(s) represents the advantage function, ratio(θ)=π(a|s)/π_old(a|s) represents the ratio, and ε represents the shear range.

        3. Update steps

        The PPO algorithm trains agents by alternately performing policy evaluation and policy improvement. In each iteration, a batch of empirical data is first collected using the current policy, and then used to compute and update the policy. The specific update steps are as follows:

  • collect empirical data;
  • Calculate the gradient and optimize the policy function;
  • Update the value function.

4. Algorithm principle

        The core principle of the PPO algorithm is to use proximal policy optimization, that is, in each iteration, the policy is continuously optimized by using a large amount of sampling data, while limiting the scope of policy changes and avoiding excessive policy updates.

        The PPO algorithm mainly includes two steps: sampling and optimization. During the sampling phase, the algorithm collects training data through interactions with the environment. In the optimization phase, the algorithm utilizes the collected data to update the policy parameters, and updates the network parameters according to the gradient information of the objective function.

        The basic idea of ​​the PPO algorithm is to use an importance sampling ratio to control the range of policy updates. At each update, the algorithm computes an importance sampling ratio between the new policy and the old policy, and uses this ratio to limit the scope of policy updates. By introducing a clipping term to limit the excessive policy change, the PPO algorithm can effectively improve the stability and efficiency of training.

5. Algorithm function

        The PPO algorithm has the following functions:

  1. Policy-based optimization: The PPO algorithm improves the performance of the agent in the environment by optimizing the policy, so as to achieve optimal decision-making and behavior.
  2. Efficient and stable: The PPO algorithm limits the range of policy updates and avoids excessive updates, thereby improving the stability and efficiency of training.
  3. Wide applicability: The PPO algorithm is suitable for solving continuous action space and high-dimensional state space problems, and can be applied in many fields, such as robot control, game intelligence, etc.

6. Example code

        Below is a simple sample code of the PPO algorithm for solving the CartPole reinforcement learning task.

        First, install the necessary dependencies:

pip install tensorflow
pip install gym

 

        Next, write the code for the PPO algorithm: 

# -*- coding: utf-8 -*-
import tensorflow as tf
import gym
import numpy as np

env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n
hidden_dim = 32
lr = 0.001

actor_model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(hidden_dim, activation='relu', input_shape=(state_dim,)),
    tf.keras.layers.Dense(hidden_dim, activation='relu'),
    tf.keras.layers.Dense(action_dim, activation='softmax')
])

critic_model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(hidden_dim, activation='relu', input_shape=(state_dim,)),
    tf.keras.layers.Dense(hidden_dim, activation='relu'),
    tf.keras.layers.Dense(1)
])

actor_optimizer = tf.keras.optimizers.Adam(learning_rate=lr)
critic_optimizer = tf.keras.optimizers.Adam(learning_rate=lr)

def choose_action(state):
    logits = actor_model.predict(state[np.newaxis, :])[0]
    action = np.random.choice(range(action_dim), p=logits)
    return action

def compute_return(rewards, gamma):
    returns = np.zeros_like(rewards)
    G = 0
    for t in reversed(range(len(rewards))):
        G = rewards[t] + gamma * G
        returns[t] = G
    return returns

def compute_advantage(states, rewards, values, gamma, lamda):
    returns = compute_return(rewards, gamma)
    values = np.append(values, 0)
    deltas = rewards + gamma * values[1:] - values[:-1]
    advantages = np.zeros_like(rewards)
    A = 0
    for t in reversed(range(len(rewards))):
        A = deltas[t] + gamma * lamda * A
        advantages[t] = A
    return returns, advantages

def train_actor(states, actions, advantages, old_probs, eps):
    with tf.GradientTape() as tape:
        logits_new = actor_model(states, training=True)
        probabilities_new = tf.reduce_sum(tf.one_hot(actions, action_dim) * logits_new, axis=1)
        ratios = tf.exp(tf.math.log(probabilities_new) - tf.math.log(old_probs))
        surrogate_obj1 = ratios * advantages
        surrogate_obj2 = tf.clip_by_value(ratios, 1-eps, 1+eps) * advantages
        surrogate_obj = tf.minimum(surrogate_obj1, surrogate_obj2)
        loss = -tf.reduce_mean(surrogate_obj)
    grads = tape.gradient(loss, actor_model.trainable_variables)
    actor_optimizer.apply_gradients(zip(grads, actor_model.trainable_variables))

def train_critic(states, returns):
    with tf.GradientTape() as tape:
        values = critic_model(states, training=True)
        mse = tf.keras.losses.MeanSquaredError()
        loss = mse(returns, tf.squeeze(values))
    grads = tape.gradient(loss, critic_model.trainable_variables)
    critic_optimizer.apply_gradients(zip(grads, critic_model.trainable_variables))

gamma = 0.99
lamda = 0.95
eps = 0.2
max_episodes = 200
max_steps_per_episode = 1000

for episode in range(max_episodes):
    state = env.reset()
    done = False
    episode_reward = 0
    states, actions, rewards, values, old_probs = [], [], [], [], []

    for step in range(max_steps_per_episode):
        action = choose_action(state)
        next_state, reward, done, _ = env.step(action)

        states.append(state)
        actions.append(action)
        rewards.append(reward)
        values.append(critic_model.predict(state[np.newaxis, :])[0])
        old_probs.append(actor_model.predict(state[np.newaxis, :])[0][action])

        episode_reward += reward
        state = next_state

        if done:
            break

    states = np.array(states)
    actions = np.array(actions)
    rewards = np.array(rewards)
    values = np.array(values)
    old_probs = np.array(old_probs)

    returns, advantages = compute_advantage(states, rewards, values, gamma, lamda)
    returns = returns.astype('float32')
    advantages = advantages.astype('float32')

    train_actor(states, actions, advantages, old_probs, eps)
    train_critic(states, returns)

    print(f"Episode {episode+1}: Reward = {episode_reward}")

env.close()

        operation result: 

Episode 1: Reward = 14.0
Episode 2: Reward = 13.0
Episode 3: Reward = 9.0
...
Episode 198: Reward = 500.0
Episode 199: Reward = 500.0
Episode 200: Reward = 500.0
 

        This sample code uses the PPO algorithm to train an Actor model and a Critic model, collects training data and updates model parameters by interacting with the environment. Eventually, a gradual increase in rewards can be observed in the CartPole task, reaching a plateau of a maximum reward of 500. 

7. Summary

        This article introduces the PPO algorithm in reinforcement learning in detail, including its introduction, development history, algorithm formula, algorithm principle, algorithm function, sample code and operation results, and how to use it. The PPO algorithm is a strategy-based optimization algorithm, which optimizes the strategy by maximizing the objective function, and has the characteristics of high efficiency, stability and wide applicability. Through the explanation of the sample code, readers can understand the specific implementation and usage of the PPO algorithm. I hope this article can deepen the understanding of the PPO algorithm for readers and apply it to practical problems.

Guess you like

Origin blog.csdn.net/Code_and516/article/details/131450149