DQN adjustment hyperparameter experience

Continued from the previous article
(Deep Reinforcement Learning with Double Q-learning, H. van Hasselt et al., arXiv, 2015) exercises

DQN has many hyperparameters, mainly including: experience playback set size, neural network structure, GAMMA value, learning rate, e_greedy value and positive and negative reward values. With so many hyperparameters, it is difficult to adjust well, and we must carefully understand and summarize.

========================== Experience playback size ========================
First is the experience Playback collection dimensions. The agent puts the perception of the environment into the experience playback as the sample set of the neural network. My experience is that if the size of the experience playback set is too small, it is necessary to choose to discard part of the experience. If the discarded experience is very important, it will bring instability to the training. Therefore, the larger the experience playback set, the better. If you are not sure what kind of experience can be discarded, don't discard it. But this is not to say that finding a large piece of memory for experience playback will be all right. What experience storage to choose and what experience playback to choose are very important issues, but they cannot be discussed in detail here.

As you can see, in my code, once the experience playback is full, the oldest experience will be overwritten with the latest experience (this is not a good practice). If the size of the experience playback set is small, the agent will lose the experience of failing near the birth point. As a result, shortly after the experience playback is full, the chance of the agent failing near the birth point increases sharply, resulting in unstable training.
As shown in the figure below, the output Q value reaches the maximum value after the experience playback is full, and then as the experience is discarded, the Q value oscillates, and finally converges to a bad value.
Write picture description here

====================== Neural Network Structure ============================== ===
I have not tried neural networks of various structures. According to the knowledge of empirical risk minimization, the complexity of the neural network is related to the number of samples. Therefore, I think that when we construct a neural network, we must consider the difficulty of the task, the size of the experience playback and e_greedy. The main thing to note is that if we set up e_greedy with the idea of ​​being very exploration-oriented, the experience in experience replay will be richer, thus requiring a more complex neural network structure.

======================= GAMMA value ============================= ======
Under the same conditions of other hyperparameters, the GAMMA values ​​= 0.9, 0.95, 0.99 were tested respectively.
They both converge equally well, but at a value of 0.9, they converge in about 1500 episodes, at a value of 0.95, they converge in about 2500 episodes, and at a value of 0.99, they converge in 6500 episodes, and the process is not so stable.
I guess, the higher the GAMMA value, it means that we want the agent to pay more attention to the future, which is harder than paying more attention to the present, so the training is slower and more difficult.

============================== Learning Rate ======================= ========== Under the same conditions for other hyperparameters, the learning rate a = 0.002, 0.001, 0.0005, 0.0001 a=0.002
was tested respectively , the process is the life cycle (per episodes) initial rapid growth
, to 140 times (maximum 200 times), then decrease, finally converge to about 40 times when
a=0.001, increase to 140, finally converge to about 50 times when
a=0.0005, increase to 180, converge to 70 or so
a=0.0001 , increase to 200, do not decrease

The trend is obvious, too high learning rate, in the initial stage, when the number of samples is small, the effect is very good (although it is obviously overfitting). However, as the number of learning samples increases, the learning tasks become more and more complex, and too high a learning rate leads to poor convergence. As shown in the figure below, the cost only dropped in the initial stage, and has been oscillating since then:
Write picture description here

============================= Positive and negative return value ==================== ===============
The absolute value of the return value is limited to [-1, 1], because the negative return is sparse, so it is set to -1.
Under the same conditions of other hyperparameters, the positive return values ​​of 0.1, 0.01, and 0.001 were tested respectively. When
+r = 0.1, the effect is very poor. When +r = 0.01, the effect is the best. When +r = 0.001, The effect is next

I think ideally a simple design like -1 for negative returns and +1 for positive returns should work. Because the design return value is only used to distinguish the value of the action. In practice, however, this is not the case. The figure below shows the ratio of the difference between the Q value of the output good action and the Q value of the bad action when the positive return is 0.1, on average, to the output Q value. It can be seen that it is close to 105 (That is the difference between 1 and 1.00001). This means that a small amount of noise will overwhelm it, making the network unable to correctly output good actions.Write picture description here

The Q values ​​of the two actions output in the figure above are too close because the negative rewards appear to be too small. Therefore, reduce the positive return by an order of magnitude, that is, 0.01 (the negative return is limited to -1), as shown in the figure below, you can see that the order of magnitude is at 102 (the difference between 1 and 1.01), the actual effect is much better
Write picture description here

So what happens if positive returns are further reduced?
It is actually not good. When the positive return is too weak, the average Q value is a relatively large negative value, and the difference in the output Q value is not obvious.
As shown in the figure below, the order of magnitude is 103( the difference between 1 and 1.001 )
Write picture description here

The experience is that the design return value is very important. The design is such that the positive and negative returns can play against each other.
One of the design criteria is to make the mean value of the Q value at the time of convergence as close to 0 as possible. At this time, the difference of each action is most easily reflected.
The second criterion is that the positive return should be slightly larger, because the change process of the Q value is to decrease first, and then gradually increase. The initial stage is the period of negative return rule, and manual intervention improves the positive return, which is helpful for training.

Guess you like

Origin blog.csdn.net/qq_32231743/article/details/72809101