Neural network model optimization-training optimization

Batch training

Cut the training data into many batches, and then feed the model batch by batch;
Insert picture description here

advantage

  1. Increase training speed;
  2. Introduce to the training processRandomness

This oneRandomnessThe most important thing is that it can effectively solve the local optimum. I now divide it into many batches. The initial position of each batch is very random. It is like sprinkling beans. Sprinkle a little in each place, so it is not easy to get stuck in a certain small area. pit.
Insert picture description here
The red part represents the batch data, so that almost every local optimum (green) can be stepped on, and then the global optimum can be found.

Gradient descent with momentum optimizer

Too much randomness has a disadvantage. Not every iteration will go in a good direction, it will cause the calculation time to be too long. At this time, the parameter β can be introduced to introduce inertia.
Insert picture description here
The larger β is, the smaller (1-β) is, and dW has an effect on vw v_wvwThe smaller the impact, which is equivalent to vw v_wvwThe greater the inertia, the harder it is to change, vb v_bvb is also like this.

RMSProp optimizer

The first formula is dW 2 , and the previous optimizer d W dWd W is different.
When W is updated, the amount of change is divided by the root signS w S_wSw
Insert picture description here
Compared with the previous pure inertia, which stumbles towards the global optimal solution, this optimizer will move to the global optimal solution more steadily.

advantage

Relatively stable

Adaptive moment estimation Adam optimizer

It can be simply understood as the combination of the previous two optimizers, introducing the parameters
β 1, β 2, ε, t β_1, β_2, ε, tb1,b2,ε ,t
β 1 β_1b1Often take 0.9, β 2 β_2b20.999 is often taken,
where ε εε is a very small number, often 1e-8, 1e-9, to prevent the explosion.
ttt is the number of iterations, which is equivalent to considering the pace, which can be known from the following formula, so thatvw v_wvw, v b v_b vbWhen changing, the steps are large at the initial moment, and the global optimal solution is quickly moved forward, and at the end, it is a little more cautious and slowly moves towards the global optimal solution.

In addition, the parameters are not necessarily these values, and need to be carefully adjusted by the big guys.
Insert picture description here

advantage

It is suitable for large-scale data, and can also solve problems such as high noise and sparse gradient.

Guess you like

Origin blog.csdn.net/weixin_44092088/article/details/112980837