Brief description of DQN (deep Q-network) algorithm

This article briefly introduces the DQN (deep Q-network) algorithm in deep reinforcement learning by sorting out the content of the machine learning tutorial of Mr. Li Hongyi.

Li Hongyi's link to station B of the teacher's course:
Li Hongyi, deep reinforcement learning, Q-learning, basic idea
Li Hongyi, deep reinforcement learning, Q-learning, advanced tipsLi
Hongyi, deep reinforcement learning, Q-learning, continuous action

Related Notes:
Policy Gradient Algorithm Brief
Proximal Policy Optimization Algorithm Brief
Actor-Critic Algorithm Brief


1. Basic concepts


DQN is a value-based method rather than a policy-based method. What is learned is not a policy, but a critic. The critic does not take action directly, but evaluates whether the action is good or bad.

1.1 State value function (state value function)


There is a critic called the state value function (state value function) V π ( s ) V^{\pi}(s)Vπ (s), that is, with a strategyπ \piπ interacts with the environment, when in a certain statesss , the expected value of the cumulative incentives until the end of the game. When the strategies are different, even if the state is the same, the incentives obtained are different,V π ( s ) V^{\pi}(s)Vπ (s)is different.

In addition, since it is impossible to enumerate all the states, V π ( s ) V^{\pi}(s)Vπ (s)is actually a network, which is also a regression problem during training.

Specifically, there are two different methods for measuring the state-value function: Monte-Carlo based approach (MC) and temporal-difference approach (TD).

The method based on Monte Carlo is to let the actor interact with the environment, and the optimization goal is to make the V π ( s ) V^{\pi}(s) of each stateVπ (s)and subsequent accumulative incentivesG s G_sGsas close as possible.
MC based approach

However, the Monte Carlo-based method needs to play the game to the end every time the network is updated, but some games take a long time, so the method based on timing difference will be used . This method does not need to play the game to the end, but makes the difference of the value function in the adjacent state as close as possible to the incentive of the previous state through the method of timing difference:
TD appraoch

Since the game itself may be random, the incentive is a random variable, and its variance will affect the algorithm effect. Monte Carlo-based methods due to the use of cumulative excitation G s G_sGs, the variance will be very large; while the method based on temporal difference uses a single-step excitation rt r_trt, the variance is relatively small, but there will be a problem that V π ( st + 1 ) V^{\pi}(s_{t+1})Vπ (st+1) may be inaccurate and may affect learning outcomes. In fact, the method based on timing difference is more commonly used, while the method based on Monte Carlo is less common.
MC v.s. TD 1

In addition, the estimation results produced by the two methods may also be different. For example:
MC v.s. TD 2
in the above example, when the state sa s_asagenerate sb s_bsbWhen it is not a coincidence, that is sa s_asaAffected sb s_bsb, see sa s_asaafter sb s_bsbwill not get incentives, the method based on Monte Carlo is reasonable; and if the state sa s_asagenerate sb s_bsbJust a coincidence, the method based on timing difference is reasonable.


1.2 State-action value function (state-action value function, Q function)


Another critic is called the state-action value function (state-action value function), also called the Q function, that is, in a certain state sss takes an actionaaa , assuming the same strategyπ \piπ , the expected value of the obtained cumulative incentive.

It should be noted that for the strategy π \piFor π , in statesss does not necessarily take actionaaa , but the Q function can force it to take actionaaa , and the strategyπ \piπ continues, that is,Q π ( s , a ) Q^{\pi}(s, a)Qπ (s,a ) . Specifically, there are two ways to write the Q function:
Two ways of writing the Q function

As long as you have the Q function, you can do reinforcement learning. The flow chart is as follows:
flow chart

任何:
π ′ ( s ) = arg ⁡ max ⁡ a Q π ( s , a ) \pi^{\prime}(s) = \arg \max_a Q^{\pi}(s, a)Pi(s)=argamaxQπ (s,a)

Therefore, there is actually no so-called strategy π ′ \pi^{\prime}Piπ ′ \pi^{\prime}Pi is derived from the Q function.

The following proves why the π ′ \pi^{\prime} deduced by the Q functionPi ratioπ \piPi is better.

The so-called good means that for all states, the state value function is larger, and the specific derivation is as follows:
V π ( s ) = Q π ( s , π ( s ) ) ≤ max ⁡ a Q π ( s , a ) = Q π ( s , π ′ ( s ) )   V π ( s ) ≤ Q π ( s , π ′ ( s ) ) = E [ r t + V π ( s t + 1 ) ∣ s t = s , a t = π ′ ( s t ) ] ≤ E [ r t + Q π ( s t + 1 , π ′ ( s t + 1 ) ) ∣ s t = s , a t = π ′ ( s t ) ] = E [ r t + r t + 1 + V π ( s t + 2 ) ∣ . . . ] ≤ E [ r t + r t + 1 + Q π ( s t + 2 , π ′ ( s t + 2 ) ) ∣ . . . ] = . . . ≤ V π ′ ( s ) V^{\pi}(s) = Q^{\pi}(s, \pi(s)) \leq \max_a Q^{\pi}(s, a) = Q^{\pi}(s, \pi^{\prime}(s)) \\ \ \\ V^{\pi}(s) \leq Q^{\pi}(s, \pi^{\prime}(s)) = E[r_t + V^{\pi}(s_{t+1}) | s_t = s, a_t = \pi^{\prime}(s_t)] \leq E[r_t + Q^{\pi}(s_{t+1}, \pi^{\prime}(s_{t+1})) | s_t = s, a_t = \pi^{\prime}(s_t)] = E[r_t + r_{t+1} + V^{\pi}(s_{t+2}) | ...] \leq E[r_t + r_{t+1} + Q^{\pi}(s_{t+2}, \pi^{\prime}(s_{t+2})) | ...] = ... \leq V^{\pi^{\prime}}(s) Vπ (s)=Qπ (s,π ( s ) )amaxQπ (s,a)=Qπ (s,Pi(s)) Vπ (s)Qπ (s,Pi(s))=E [ rt+Vπ (st+1)st=s,at=Pi(st)]E [ rt+Qπ (st+1,Pi(st+1))st=s,at=Pi(st)]=E [ rt+rt+1+Vπ (st+2)...]E [ rt+rt+1+Qπ (st+2,Pi(st+2))...]=...VPi(s)


1.3 Training techniques


Here are a few techniques that will definitely be used in DQN.


1.3.1 Target network


The first trick is the target network.

According to the Q function:
Q π ( st , at ) = rt + Q π ( st + 1 , π ( st + 1 ) ) Q^{\pi}(s_t, a_t) = r_t + Q^{\pi}(s_ {t+1}, \pi(s_{t+1}))Qπ (st,at)=rt+Qπ (st+1,π ( st+1))

Among them, the left side of the equal sign is the output of the network, and the right side is the target, but because the target contains the Q function, the target is always changing, which will bring difficulties to the training.

The solution is to fix one of the Q networks (usually the target network on the right side of the equal sign) to minimize the mean square error between the model output and the target. When the Q network on the left side of the equal sign is updated After several times, replace the target network with the updated Q network and continue to iterate. As shown below:
target network

1.3.2 Exploration


The second technique is exploration.

If in a certain state, all actions have not been taken, and taking a certain action at this time is positively motivated, it will lead to only taking this action when this state occurs later, without exploring other actions:
exploration, background

This problem is the exploration-exploitation dilemma (exploration-exploitation dilemma).

There are two solutions: ϵ \epsilonϵ Greedy (epsilon greedy) and Boltzmann exploration (Boltzmann exploration).

ϵ \epsilonThe ϵ greedy method is as follows:
epsilon greedy
the formula of Boltzmann exploration is as follows:
P ( a ∣ s ) = e Q ( s , a ) ∑ ae Q ( s , a ) P(a | s) = \frac {e^{Q (s, a)}} {\sum_a e^{Q(s, a)}}P(as)=aeQ(s,a)eQ(s,a)

This method is a bit like the strategy gradient, that is, the probability distribution of an action is determined according to the Q function. The larger the Q value, the higher the probability of taking the action, and the exponential operation makes the probability not 0, that is, it does not mean that the action with a small Q value cannot be tried.

1.3.3 Experience replay


The third technique is experience replay, as shown in the figure below:

Experience playback

Experience playback has two benefits:

First, when actually doing reinforcement learning, the most time-consuming step is often to interact with the environment, but training the network is relatively fast (training with GPU is actually very fast). Using the replay buffer can reduce the number of interactions with the environment, because during training, the experience does not all need to come from a certain policy. The experience obtained from some past strategies can be used many times in the cache area, so the sampling utilization rate is relatively efficient.

Second, when training the network, we hope that the more diverse the data in a batch, the better. If the data are of the same nature, the performance will be poor. If the experience in the data cache area comes from different strategies, it is easy to satisfy diversity.

What is meant here is that the empirical data in the buffer area, even if the strategy π \pi usedπ is different from the current strategy and has no effect. The reason is that the sampling experience we use for each iteration is based on a state, not a trajectory (trajectory), so it is not affected by off-policy.


1.4 Algorithm process


Algorithm process


2. Advanced skills


2.1 double DQN


In the original DQN algorithm, due to network errors, overestimated actions will be repeatedly selected, so the Q value is often overestimated.

In order to solve this problem, two networks can be used at the same time, one network QQQ is used to update the parameter selection action, another fixed networkQ ′ Q^\primeQ is used as the target network to calculate the Q value, which is double DQN:
Q ( st , at ) = rt + Q ′ ( st + 1 , arg ⁡ max ⁡ a Q ( st + 1 , a ) ) Q(s_t, a_t) = r_t + Q^\prime(s_{t+1}, \arg \max_a Q(s_{t+1}, a))Q(st,at)=rt+Q(st+1,argamaxQ(st+1,a))

double DQN

参考文献:
Hado V. Hasselt, “Double Q-learning”, NIPS 2010
Hado V. Hasselt, Arthur Guez, David Silver, “Deep Reinforcement Learning with Double Q-learning”, AAAI 2016


2.2 dueling DQN


The only difference between dueling DQN and original DQN is to change the architecture of the network:

dueling DQN I

The benefit of changing the architecture: Sometimes it can be updated by updating V ( s ) V(s)V ( s ) butA ( s , a ) A(s, a)A(s,a ) The effect can be achieved
In order to make the network tend to updateV ( s ) V(s)V ( s ) instead ofA ( s , a ) A(s, a)A(s,a),可以对 A ( s , a ) A(s, a) A(s,a ) Add some constraints.
dueling DQN II

References:
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas, “Dueling Network Architectures for Deep Reinforcement Learning”, arXiv preprint, 2015


2.3 Prioritized replay


For training data with a large gap between the output and the target, set a larger probability of being sampled, that is, priority.

In fact, when doing prioritized replay, not only will the sampling process be changed, but also the method of updating parameters will be changed due to changing the sampling process. So it doesn't just change the distribution of sampled data, it also changes the training process.

prioritized replay

References:
Prioritized Experience Replay


2.4 Multi-step sampling (multi-step)


Through continuous multi-step sampling, a balance can be achieved between Monte Carlo-based methods and temporal difference-based methods:
Q ( st , at ) = ∑ t ′ = tt + N rt ′ + Q ^ ( st + N + 1 , at + N + 1 ) Q(s_t, a_t) = \sum_{t^\prime=t}^{t+N} r_{t^\prime} + \hat{Q}(s_{t+N+ 1}, a_{t+N+1})Q(st,at)=t=tt+Nrt+Q^(st+N+1,at+N+1)

multi-step


2.5 noise network (noisy net)


As mentioned earlier, ϵ \epsilonϵ Greedy exploration is equivalent to adding noise to the action space, but there is a better method called noisy net, which is to add noise to the parameter space. It means that every time at the beginning of an episode, a Gaussian noise is added to each parameter of the Q network, and the original Q function becomes Q ~ \tilde{Q}Q~ :
a = arg ⁡ max ⁡ a Q ~ ( s , a ) a = \arg \max_a \tilde{Q}(s, a) a=argamaxQ~(s,a)

It should be noted here that the noise will not change within a round.

OpenAI and DeepMind proposed exactly the same method at the same time, and both were published in the ICLR 2018 conference. The difference is that OpenAI's method directly adds a Gaussian noise to each parameter; DeepMind's method is more complicated, and the noise is composed of a set of parameters control, the network can decide the size of the noise by itself.

The advantage of adding noise in the parameter space is that the same action can be guaranteed to be output in the same state, that is, state-dependent exploration.

References:
Parameter Space Noise for Exploration, OpenAI
Noisy Networks for Exploration, DeepMind


2.6 Distributed Q-function (distributional Q-function)


Since the Q function is the expected value of cumulative incentives, and the same expectation may correspond to different distributions, it makes more sense to model the distribution.

distributional Q-function I
The specific method is to map the value of the distribution to a certain range, and output the probability of each value in the range:
distributional Q-function II
Although the action with the largest average value is still selected for execution during the actual test, there are actually other methods, such as two actions When the average value of , choose an action with less risk to execute.

This method is not easy to implement.


2.7 rainbow


The last trick is called rainbow (rainbow), that is, assuming that each of the above methods corresponds to a color, and combining all the methods, it becomes a rainbow.
(Actually there are only 6 tricks up front.)

You can compare the effects of rainbow and various single methods, where the median score is used to prevent excessive fluctuations:
(Note: There is A3C here but there is no simple multi-step)

rainbow I

You can also remove a method in rainbow to compare the effects of each algorithm:

rainbow II

It can be seen here that removing the double DQN has little effect. The explanation in the paper is that when the distributional Q-function is used, the incentives are usually not overestimated, but may be underestimated due to the limited scope.

Reference:
Rainbow: Combining Improvements in Deep Reinforcement Learning


3. Continuous action scenes


DQN actually has some problems. The biggest problem is that it is not easy to deal with continuous actions.

As mentioned earlier, QQ must be used every time an action is selectedThe Q value is the largest:
a = arg ⁡ max ⁡ a Q ( s , a ) a = \arg \max_a Q(s, a)a=argamaxQ(s,a)

There are four solutions:

The first scheme can sample NNN possible actions{ a 1 , a 2 , . . . , an } \{ a_1, a_2, ..., a_n \}{ a1,a2,...,an} , substitute intoQQIn the Q function, take QQThe action with the largest Q value.

The second solution, since it is an optimization problem to be solved, the gradient ascent method can be used, but this means that the network needs to be trained every time an action is selected, so the calculation is very heavy.

The third solution is to design a network architecture so that QQMaximization of the Q function becomes easy:

Continuous action

Among them, ∑ \sumThe Σ matrix must be positive definite. Then,( a − μ ( s ) ) T ∑ ( s ) ( a − μ ( s ) ) (a - \mu(s))^T\sum(s)(a - \mu(s))(am ( s ) )T(s)(aThe smaller the value of μ ( s ) ) , the finalQQThe Q value is larger. The paper has introduced guarantees∑ \sumThe Σ matrix is ​​a positive definite method.

The fourth solution is not to use DQN. Consider combining the policy-based method PPO with the value-based method DQN to get the actor-critic method:
actor-critic

Guess you like

Origin blog.csdn.net/Zhang_0702_China/article/details/123423637