(DDPG) Depth Determination Strategy Gradient Tuning Experience

It took a week and finally pulled up a working model last night, it was really hard. Hurry up and write down the memo.

I directly used the parameters in the paper, I did not call out the model, and the parameters were basically modified. The figure below is the paper's configuration description for the parameters.
Write picture description here

According to the paper.
1, "a base learning rate of 103 and 104 for the actor and critic respectively". The paper uses 103 To train the actor network with a learning rate of , use 104 The learning rate trains the critic network.
For the training model, my understanding is that the learning rate can be smaller. And perception and training do not need to be performed simultaneously. In the algorithm of the paper, every time a perception is made, the model needs to be trained once. In my opinion, this is not necessary. Multi-perception environment will make the model more stable, and frequent learning will only hover around the current optimal value repeatedly without fully perceiving the environment. So my approach is to use a small learning rate to train the model 25 times per perception 200 times, which can speed up the overall learning speed. The learning rates I use are 103 (For models with different network structures, it is meaningless to directly compare the learning rate).

2. Next, the paper says that the regularization of l2 is used for the critic network, the parameter is 0.01, the loss factor of 0.99 is used, and the td update parameter is 0.001.
I used l2 regularization for both networks with a parameter of 0.006, a dropout factor of 0.9 and a td update parameter of 0.1. Using a loss of 0.9 is to reduce the difficulty of model training, and of course it will also reduce the ability of the model. Using td of 0.1 to update parameters is to speed up model training, but of course this will also reduce model stability. The values ​​of the parameters are the result of trade-offs.

3. The paper said that the activation function of rectified non-linearity was used (what is this? It should not be relu anyway?).
I used tanh activation function. The reason for not choosing relu is that the advantage of relu lies in the convergence speed block, which is a property that this algorithm does not care about. At the same time, relu also has a disadvantage that some neurons will be permanently inactivated. When using a large learning rate, I have encountered the situation where almost all neurons are deactivated and the output is 0.

4. The paper uses 2 hidden layers, 400 and 300 units of neural network structure.
I used the same network structure but reduced the number of neurons. Because I use a relatively simple environment, 3 state dimensions, 1 action dimension. For the actor network, I used 30, 20 neurons. For the critic network, I used 50, 40 neurons.

5. The paper uses a relatively complex initialization parameter method. A batch_size of 64 is used, and the experience playback size is 1,000,000.
The neural network is different, so the initialization parameter method mentioned in the paper has no direct reference significance. I used N~(0, 1fin ) initialization method. Using a batch_size of 32, the training speed is faster, the stability is worse, and the experience playback size is 500,000. Compared with the paper, it is halved. At the same time, in order to speed up the training, I only used 500,000 replays for the critic network, and only 100,000 replays for the actors. The reason is that for the state s1 , if the previous actor network output was a0=0.01 , while the current actor network output is a1=0.5 , I think, currently, the actor network does not need to be in a0=0.01 The strategy is improved nearby, because the current actor network does not output actions near 0.01, but only outputs actions near 0.5. Thus, actors can only learn from experience close to the current policy. But the critic needs experience in a wide range of environments, and the judges have more knowledge than the performers =. =. =. =. =.

6. The paper uses a complex process (Ornstein-Uhlenbeck process) to add noise to the output action of the model.
I just used simple Gaussian noise, just for simplicity.

7. In order to normalize the output state and action, the paper uses batch_normalization.
I didn't use it, one is because the environment of the gym is better, the state and action are easy to normalize, and the other is that for the two-layer network, the problem of gradient disappearance is not obvious. The third is for training speed (bn will reduce the training speed by 30%).

Guess you like

Origin blog.csdn.net/qq_32231743/article/details/73770055