Deep reinforcement learning - DQN algorithm principle

1. What is the DQN algorithm

DQN, or Deep Q-network, refers to the Q-Learing algorithm based on deep learning.

Review Q-Learing: Reinforcement Learning——Q-Learning Algorithm Principle
Q-Learing algorithm maintains a Q-table, and uses the table to store the reward obtained by taking action a in each state s, that is, the state-value function Q(s,a ), this algorithm has great limitations. In many cases in reality, the state space faced by the reinforcement learning task is continuous, and there are infinitely many states. In this case, the value function can no longer be stored in the form of a table.

In order to solve this problem, we can use a function Q(s,a;w) to approximate the action-value Q(s,a), called the value function approximation Value Function Approximation , we use the neural network to generate this function Q(s ,a;w), called Q network (Deep Q-network) , w is the parameter of neural network training.

2. DQN training process

The input of the neural network is the state s, and the output is the scoring
insert image description here
map for all actions a. Source: Shusen Wang Deep Reinforcement Learning Course

The training of the neural network is an optimization problem. We need to represent the difference between the network output and the label value as a loss function. The goal is to minimize the loss function. The method is to update the neural network by using gradient descent through backpropagation. parameters of the network.

So what is the label value/target value of the Q network?
It is TD target : yt = rt + γ ⋅ max ⁡ a Q ( st + 1 , a ; w ) {\color{Red}y_t = r_t + \gamma \cdot \max_aQ(s_{t+1},a;w )}yt=rt+cmaxaQ(st+1,a;w)

We first introduce the most primitive DQN algorithm, and then we will add skills such as experience playback and objective functions.
Specific process:

1. Initialize the network, input state st s_tst, output st s_tstQ values ​​of all actions below;

2. Using strategies (such as ε − greddy \varepsilon-greddyeg re dd y ), select an actionat a_tat, put at a_tatInput to the environment, get a new state st + 1 s_{t+1}st+1and r;

3、计算TD target: y t = r t + γ ⋅ max ⁡ a Q ( s t + 1 , a ; w ) y_t = r_t + \gamma \cdot \max_aQ(s_{t+1},a;w) yt=rt+cmaxaQ(st+1,a;w)

4. Calculate the loss function: L = 1 / 2 [ yt − Q ( s , a ; w ) ] 2 L = 1/2[y_t - Q(s,a;w)]^2L=1/2[ytQ(s,a;w)]2

5. Update the Q parameter so that Q( st s_tst, a t a_t at) as close as possible to yt y_tyt, you can treat it as a regression problem, and use gradient descent to do the update work;

6. From the above steps, we get a quadruple transition: ( st , at , rt , st + 1 ) (s_t,a_t,r_t,s_{t+1})(st,at,rt,st+1) , discarded after use;

7. Enter the new status and repeat the update work

insert image description here
The picture comes from: [Zhihu. Zhang Sijun] https://zhuanlan.zhihu.com/p/110620815

3. Experience Replay

Before understanding experience playback, let's look at the shortcomings of the original DQN algorithm:
1. Use up a transition: ( st , at , rt , st + 1 ) (s_t,a_t,r_t,s_{t+1})(st,at,rt,st+1) will be discarded, which will cause a waste of experience;
2. Before, we used transitions in order, and the previous transition and the next transition are highly correlated, which is harmful to learning Q-network.

Experience replay can overcome the above two shortcomings:
1. Break up the sequence and eliminate correlation, so that the data satisfy independent and identical distribution, thereby reducing the variance of parameter updates and improving the convergence speed.
2. Experience can be reused, and the data utilization rate is high, which is especially useful for difficult data acquisition.
When performing reinforcement learning, the most time-consuming step is often to interact with the environment, but training the network is relatively fast, because we use GPU to train very quickly. Using the playback buffer can reduce the number of interactions with the environment. The experience does not need to come from a certain strategy. Some experience obtained from past strategies can be used multiple times in the playback buffer and reused repeatedly.

Experience playback will build a replay buffer (replay buffer) to store n transitions, called experience
A certain strategy π \piπ interacts with the environment, collects many transitions, and puts them into the playback buffer. The experience transitions in the playback buffer may come from different strategies.
The replay buffer only throws away old data when it is full
insert image description hereFigure source: Shusen Wang Course in Deep Reinforcement Learning

Randomly extract a batch-sized transition data training network each time, calculate multiple random gradients, and use the average of the gradients to update the Q network parameter w

Improvements to experience replay:
Prioritized Experience Replay : The difference is that non-uniform sampling is used instead of uniform sampling. I won't go into details here.

4. Target Network

Why use a target network?
I quote the description of https://blog.csdn.net/weixin_46133643/article/details/121845874 :

When we train the network, the action value estimation is related to the weight w. When the weights change, the estimate of the action value also changes. During learning, action values ​​try to chase a varying reward, prone to instability.

I have seen similar descriptions in some books on this part. Although I intuitively think that this kind of training is indeed unstable, I don't quite understand the specific manifestations of this instability or the rigorous internal logic.

The part of the target network in the Shusen Wang course video is relatively clear, and there is an overestimation problem .

1、自举(Bootstrapping)

The concept of bootstrapping is introduced here :
Bootstrapping originally means "untie the shoelaces", which comes from the allusion of bootstrapping self-help in "The Adventures of the Bragging King", which refers to lifting yourself up by pulling out the shoelaces. insert image description here
In reinforcement learning, bootstrapping refers to updating the estimate of the current state with subsequent estimates.
Our calculated TD target : yt = rt + γ ⋅ max ⁡ a Q ( st + 1 , a ; w ) {\color{Red}y_t = r_t + \gamma \cdot \max_aQ(s_{t+1},a ;w)}yt=rt+cmaxaQ(st+1,a;w)
r t {\color{Red} r_t } rtis the value max ⁡ a Q ( st + 1 , a ; w ) {\color{Red}\max_aQ(s_{t+1},a;w)} obtained according to the actual observation
maxaQ(st+1,a;w ) is based on the Q network atst + 1 s_{t+1}st+1Estimates made when
yt y_tytPart of it is estimated from the Q network, and we use yt y_tytto update the Q network itself, so this is bootstrapping.

When we calculate the TD target, we maximize the Q value: max ⁡ a Q ( st + 1 , a ; w ) \max_aQ(s_{t+1},a;w)maxaQ(st+1,a;w )
Both the maximization here and the bootstrapping process above will causeoverestimation. Usingthe target networkcan avoid bootstrapping to a certain extent and slow down the problem of overestimation. The specific analysis process will not be described here.

2. Target network:

Target Network was proposed in the 2015 paper Mnih et al. Human-level control through deep reinforcement learning. Nature, 2015 , address: https://www.nature.com/articles/nature14236/

We use a second network, called the target network, Q ( s , a ; w − ) Q(s,a;{\color{Red} w^-})Q(s,a;w ), the network structure and the original networkQ ( s , a ; w ) Q(s,a; w)Q(s,a;w ) is the same, but the parameters are differentw − ≠ ww^- \neq ww=w , the original network is calledthe evaluation network

The role of the two networks is not the same: the evaluation network Q ( s , a ; w ) Q(s,a; w)Q(s,a;w ) is responsible for controlling the agent and collecting experience; the target networkQ ( s , a ; w − ) Q(s,a;{\color{Red} w^-})Q(s,a;w)用于计算TD target: y t = r t + γ ⋅ max ⁡ a Q ( s t + 1 , a ; w − ) {\color{Red}y_t = r_t + \gamma \cdot \max_aQ(s_{t+1},a;w^-)} yt=rt+cmaxaQ(st+1,a;w)

During the update, only the evaluation network Q( s , a ; w ) is updated Q(s,a; w)Q(s,a;w ) , the weight w of the target networkQ ( st + 1 , a ; w − ) Q(s_{t+1},a;w^-)Q(st+1,a;w )weightw − w^-w remains unchanged. After a certain number of updates, the weights of the updated evaluation network are copied to the target network for the next batch of updates, so that the target network can also be updated. The introduction of the target network increases the stability of learning since the target value of the reward is relatively fixed during a period of time when the target network does not change.

insert image description here
The picture comes from: blog garden. jsfantasy reinforcement learning 7 - Deep Q-Learning (DQN) formula derivation

五、Double DQN

The introduction of the target network can alleviate the overestimation problem to a certain extent, but there is still a maximization operation, the overestimation problem is still very serious, and Double DQN can better alleviate the overestimation problem (but it has not completely eradicated the overestimation problem).

The improvement made by Double DQN is actually very simple:

We use the original network Q ( s , a ; w ) Q(s,a; w)Q(s,a;w ) , select the action that maximizes the Q value, denoted asa ∗ {\color{Red}a^*}a , then use this a ∗ {\color{Red}a^*}with the target networka Calculate the target value:
yt = rt + γ ⋅ Q ( st + 1 , a ∗ ; w − ) y_t = r_t + \gamma \cdot Q(s_{t+1},{\color{Red}a^*} ;w^-)yt=rt+cQ(st+1,a;w)
由于:
Q ( s t + 1 , a ∗ ; w − ) ≤ max ⁡ a Q ( s t + 1 , a ; w − ) Q(s_{t+1},{\color{Red}a^*};w^-) \leq \max_{\color{Red}a}Q(s_{t+1},{\color{Red}a};w^-) Q(st+1,a;w)maxaQ(st+1,a;w)

Therefore, the overestimation problem caused by maximization is further slowed down.

insert image description herePicture from: Shusen Wang Deep Reinforcement Learning Course

6. Summary

pseudocode:

Please add a picture description

Overall, deep Q-networks are very similar to Q-learning in terms of target value and how the value is updated. The main difference is: the deep Q-network combines Q-learning with deep learning, and uses the deep network to approximate the action value function, while Q-learning uses table storage; the deep Q-network adopts the training method of experience playback and randomly samples from historical data , while Q-learning directly uses the data of the next state for learning.


Reference:
[1] https://www.bilibili.com/video/BV1rv41167yx?p=10&vd_source=a433a250e74c87c3235dea6a203f8a29
[2] Wang Qi. Reinforcement Learning Tutorial [M]
[3] https://zhuanlan.zhihu.com/p /110620815
The article picture is from: Baidu Flying Paddle AlStudio , Shusen Wang Deep Intensive Learning Course, etc.*

Guess you like

Origin blog.csdn.net/weixin_44732379/article/details/127821138