Reinforcement learning - DQN and evolution process (Double DQN, Dueling DQN)

1.DQN

1.1 Concept

      DQN has made three improvements compared to Q-Learning:

1. Introduce neural network: As shown in the figure below, we hope to extract Q(s,a) from state S

2. Experience playback mechanism: When sampling continuous action space, the data before and after are strongly correlated, while neural network training requires data to have independent and identically distributed characteristics, which is simple to understand. , that is, there must be independence between the data input before and after, so for continuous spatial data, the random sampling method is used,

3. Set up a separate target network: In the following formula, θ is the weight parameter, is the target network, is the difference between the target network and the current network, and uses the error to continuously update θ.

1.2 Iterative process

Q-Learning: It can only solve tasks of limited dimensions and discrete spaces. It is beyond the capabilities of robots to control these high-dimensional continuous spaces. Then use neural networks to approximate them. Q(s,a;θ), which is DQN.

DQN: Value function approximation is to approximate Q(s, a). The ultimate goal is to obtain the optimal Π(a|s), but there is an over-optimal problem, so Double DQN appears.

Double DQN: Use two networks, as shown in the following formula. The first one selects the action a that maximizes Q(s,a) in different states, and then Among the many maximum actions, let Q^{'} give the action a a suitable Q value. If the value given by Q^{'} is too high, then let Q Give up action a, thereby solving the over-optimization problem. The formula is as follows:

Q^{'}(s_{t+1},\frac{argmax}{a}Q(s_{t+1},a))

      Further, people began to think that neither DQN nor Double DQN considered how to pay more attention to the Q corresponding to a with greater contribution. This is more in line with the human learning process, so the advantage function was introduced.

Dueling DQN: The concept of advantage function is introduced here. For example, when we are driving at high speed and there are no cars in front or behind, we only pay attention to the state value V(s ), but when encountering a car, you must pay attention to the value function value of the corresponding action, and different actions will have better or worse effects on the results. This is the action advantage.

      We all know that the state value function V(s) is the weighted result of the action value function Q(s,a). As shown in the figure below, in the s state, there areQ_{\pi }(s,a_{1}) and Q_{\pi }(s,a_{2})The two action value functions are weighted into V_{\pi }(s), and the advantage functions can be obtained respectively, A(s,a_{1})=Q_{\pi }(s,a_{1})-V_{\pi }(s) and A(s,a_{2})=Q_{\pi }(s,a_{2})-V_{\pi }(s).

      As shown in the figure below, during network training, DQN and Double DQN obtain Q(s,a), while Dueling DQN obtains two networks V(s) and A(s,a), and finally superimposes to obtain Q(s ,a).

     

      Looking further at the figure below, V(s) is calculated according to the average value of Q(s,a), and then different Q(s,a)-V(s) results in different advantage functions A(s,a).


 

Guess you like

Origin blog.csdn.net/weixin_48878618/article/details/134332747