[Reinforcement Learning] One of the commonly used algorithms "DQN"

 

Author's homepage: boy who loves to laugh. Blog_CSDN blog - deep learning, activities, python field blogger loves to laugh boy. A boy who is good at deep learning, activities, python, etc., and loves to laugh. Focus on algorithms, python, computer vision, image processing, deep learning, pytorch, neural network, opencv fields. https://blog.csdn.net/Code_and516?type=blog Personal profile: Dagong.

Continue to share: machine learning, deep learning, python-related content, daily BUG solutions, and Windows&Linux practical tips.

If you find an error in the article, please point it out, and I will correct it in time. If you have other needs, you can private message me or send me an email: [email protected] 

        Reinforcement learning is a branch of machine learning that learns how to make optimal decisions through continuous trial and error through interaction with the environment. The Deep Q-Network (DQN) algorithm is one of the classic algorithms in reinforcement learning, which combines deep learning and Q-learning algorithms to learn and solve problems in a broad range of tasks.

This article will explain in detail one of the commonly used algorithms for reinforcement learning, "DQN".


  

Table of contents

1. Introduction

2. History

3. Algorithm formula

        1. Q-learning algorithm formula:

        2. Deep neural network:

        3. DQN algorithm formula:

4. Algorithm principle

5. Algorithm function

6. Example code

7. Summary


1. Introduction

        The DQN algorithm is one of the algorithm models widely used in reinforcement learning for the first time in the field of deep learning. It was proposed by the research team of DeepMind in 2013. By combining the deep neural network with the classic reinforcement learning algorithm Q-learning, it realizes the processing of high-dimensional and continuous state spaces, and has the ability of learning and planning.

2. History

        Before the DQN algorithm was proposed, the classical algorithm in reinforcement learning was mainly the table-based Q-learning algorithm. These algorithms excel at simple low-dimensional problems, but the storage and computational complexity of the tabular representation grows exponentially as the state and action spaces increase. To solve this problem, researchers began to explore the use of function approximation, that is, using parameterized functions instead of tables.

        After that, a series of algorithms that applied deep learning to reinforcement learning were gradually developed. DQN algorithm is one of them. It was proposed by Alex Krizhevsky et al. in 2013 and is the first algorithm that combines deep learning with reinforcement learning. The DQN algorithm introduces technologies such as experience replay and fixed Q target network, which greatly improves the performance of deep neural networks in reinforcement learning. Subsequently, the DQN algorithm achieved better results than human players in Atari games, which attracted extensive attention and research.

  1. Q-learning: Q-learning is a classic algorithm in reinforcement learning, proposed by Watkins et al. in 1989. It uses a Q table to store the value of states and actions, and learns the optimal policy through continuous updating and exploration. However, the Q-learning algorithm cannot scale when faced with large-scale state spaces.

  2. Deep Q-Network (DQN): The DQN algorithm was proposed by the DeepMind team in 2013. It solves the problem of a large state space by using a deep neural network to approximate the value of the Q function. The algorithm adopts two key technologies: experience replay and fixed-Q target network.

  3. Experience playback: Experience playback is one of the core ideas of the DQN algorithm. Its basic principle is to store the experience of the agent in a playback memory, and then randomly sample from it, and use these experiences to update the model. The advantage of this is to avoid the correlation between samples and improve the stability and convergence speed of the model.

  4. Fixed Q target network: The DQN algorithm uses two neural networks, one is the main network (online network), which is used to select actions and perform model updates; the other is the target network (target network), which is used to calculate the target Q value. The parameters of the target network are fixed for a period of time, which can reduce the fluctuation of the target and improve the stability of the model.

3. Algorithm formula

        The core of the DQN algorithm is the combination of Q-learning algorithm and deep neural network.

        1. Q-learning algorithm formula:

        The Q-learning algorithm learns the optimal policy by continuously updating the Q value, and its update formula is as follows:

        Among them, s_t represents the current state, a_t represents the selected action, r_t represents the immediate reward, s_t+1 represents the next state, α is the learning rate, and γ is the discount factor. 

        2. Deep neural network:

        The DQN algorithm uses a deep neural network to fit the value of the Q function. The input is the state s, and the output is the Q value of different actions. The commonly used neural network structure is a multi-layer perceptron (MLP) or a convolutional neural network (CNN), which optimizes network parameters through training. The output size of the network is the same as the dimension of the action space.

        3. DQN algorithm formula:

        The DQN algorithm updates the model by minimizing the mean square error loss of the Q function. Its update formula is as follows:

        Among them, θ is the network parameter, and Q(s_t, a, θ-) represents the output of the target network. 

4. Algorithm principle

        The principle of DQN algorithm is to realize the processing of high-dimensional and continuous state space by using deep neural network to approximate the value of Q function. Its core idea is to learn the optimal strategy by constantly updating the parameters of the neural network so that the output Q value is close to the real Q value.

The DQN algorithm works as follows:

  1. Initialization: Initialize the parameters of the main network and the target network.

  2. Select an action: According to the current state s, use the ε-greedy strategy to select an action a.

  3. Perform actions and observe rewards: take action a, interact with the environment, observe the next state s' and immediately reward r.

  4. Store experience: store (s, a, r, s') in the experience playback memory bank.

  5. Random sampling from experience replay memory bank: Randomly sample a batch of experience from the memory bank.

  6. Calculate the target Q value: Use the target network to calculate the target Q value, that is, max(Q(s', a, θ-)).

  7. Update the main network: update the model parameters according to the loss function L(θ).

  8. Update target network: Periodically update the parameters of the target network.

  9. Repeat steps 2-8 until the termination condition is met.

5. Algorithm function

        The DQN algorithm has the following functions:

  1. Dealing with high-dimensional, continuous state spaces: Through the approximation capabilities of deep neural networks, problems in high-dimensional, continuous state spaces can be handled.

  2. Learning and planning ability: Through interaction with the environment and continuous trial and error, the DQN algorithm can learn the optimal strategy and have a certain planning ability.

  3. High stability and convergence speed: The DQN algorithm improves the stability and convergence speed of the model through technologies such as experience playback and fixed Q target network.

6. Example code

        The following is a sample code that uses the DQN algorithm to solve the classic CartPole problem:

# -*- coding: utf-8 -*-
import gym
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

env = gym.make('CartPole-v0')
n_actions = env.action_space.n
n_states = env.observation_space.shape[0]

def create_dqn_model():
    model = Sequential()
    model.add(Dense(32, input_shape=(n_states,), activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(n_actions, activation='linear'))
    model.compile(loss='mse', optimizer=Adam(lr=0.001))
    return model

def choose_action(state, epsilon):
    if np.random.rand() < epsilon:
        return np.random.choice(n_actions)
    else:
        q_values = model.predict(state)
        return np.argmax(q_values[0])

def train_dqn():
    epsilon = 1.0
    epsilon_min = 0.01
    epsilon_decay = 0.995
    batch_size = 32
    replay_memory = []
    for episode in range(500):
        state = env.reset()
        state = np.reshape(state, [1, n_states])
        done = False
        steps = 0

        while not done:
            env.render()
            action = choose_action(state, epsilon)
            next_state, reward, done, _ = env.step(action)
            next_state = np.reshape(next_state, [1, n_states])
            replay_memory.append((state, action, reward, next_state, done))
            state = next_state
            steps += 1

            if done:
                print("Episode: %d, Steps: %d" % (episode, steps))
                break
            if len(replay_memory) > batch_size:
                minibatch = np.random.choice(replay_memory, batch_size, replace=False)
                states_mb = np.concatenate([mb[0] for mb in minibatch])
                actions_mb = np.array([mb[1] for mb in minibatch])
                rewards_mb = np.array([mb[2] for mb in minibatch])
                next_states_mb = np.concatenate([mb[3] for mb in minibatch])
                dones_mb = np.array([mb[4] for mb in minibatch])

                targets = rewards_mb + 0.99 * (np.amax(model.predict_on_batch(next_states_mb), axis=1)) * (1 - dones_mb)
                targets_full = model.predict_on_batch(states_mb)
                ind = np.array([i for i in range(batch_size)])
                targets_full[[ind], [actions_mb]] = targets

                model.fit(states_mb, targets_full, epochs=1, verbose=0)

            if epsilon > epsilon_min:
                epsilon *= epsilon_decay

    env.close()

if __name__ == '__main__':

    model = create_dqn_model()

    train_dqn()

        operation result: 

Episode: 0, Steps: 14
Episode: 1, Steps: 26
Episode: 2, Steps: 16
Episode: 3, Steps: 12
Episode: 4, Steps: 12
...
Episode: 498, Steps: 160
Episode: 499, Steps: 200

 

        By running the above code, you can see the performance of the DQN algorithm in solving the CartPole problem. After multiple episodes of training, the algorithm can persist in the game for a longer time and finally achieve a higher score. 

7. Summary

        This article explains the DQN algorithm in detail, including development history, algorithm formula and principle, function, sample code and how to use it. The DQN algorithm realizes the processing of high-dimensional and continuous state space by combining deep learning and Q-learning algorithm, and has the ability of learning and planning. Through the running results of the sample code, we can see that the DQN algorithm has achieved good results in solving the CartPole problem. However, the DQN algorithm also has some limitations, such as training instability, sample correlation and other issues. Future research can further improve the algorithm and apply it to a wider range of task domains.

 

 

 

Guess you like

Origin blog.csdn.net/Code_and516/article/details/131449240