The solution to the unsatisfactory loss function result - machine learning

How to get better training results

Tips for training.

insert image description here

Model bias

The model is too simple.

Solution: redesign your model to make it more flexible.

Optimization

Optimization fails because of critical point (the point where the differential is 0, including the saddle point of the local minama and saddle point saddle surface)

illustrate:

  • Every time there is an additional parameter, the dimension will increase by one, and it is easier to generate a saddle point instead of a local minima.

  • From experience, local minima is not so common, so in many cases, it is possible to use saddle point to optimize less.

  • Next, continue to reduce the value of Less through the saddle point.

Hessian matrix optimization

insert image description here

When going to the critical point, the gradient is 0, and the green item is 0.

insert image description here

H can tell us the direction of parameter update:
insert image description here

If you encounter a Saddle point, you can use the negative eigenvalues ​​in the Hessian matrix to find its corresponding eigenmatrix μ, θ - θ ' = μ, then θ ' = θ - μ, you can reduce the value of L(θ) .

However, this method needs to calculate the Hessian matrix, involves the second order derivative, and also needs to find the eigenvalue and eigenmatrix of the matrix. The amount of calculation is very large, so this method is generally not used in practical operations.

Small Batch

Small batches tend to give better results than large batches.

  • When a gradient descent is stuck, changing a batch may not be stuck, avoiding the problem of falling into the local minama and not being able to jump out. This is the advantage of more batches.
    insert image description here

Gradient Descent + Momentum

The idea of ​​simulating inertia in physics: the direction of motion depends not only on the direction of the gradient, but also on the direction of the previous step of motion.

Movement = movement of last step minus gradient at present.

  • In this way, the problem of falling into the local minama and not being able to jump out can be avoided to a certain extent.

insert image description here

According to the iteration, each movement m i can be represented by the gradient of the previous i - 1 times:

insert image description here

Adaptive Learning Rate

When the loss is no longer decreasing, the gradient does not necessarily become really small.

Therefore, it is necessary to have a customized learning rate for each parameter.

Please add a picture description
It may be stuck in the valley of the error surface as shown in the left figure. Since the learning rate is fixed, less can no longer decrease.

Therefore, wouldn't the problem be solved by directly reducing the learning rate?

However, directly reducing the learning rate may make the training enter the "valley", but in a smoother place, because the learning rate is too small, it is difficult to make the training progress.

It is therefore desirable to:

  • The learning rate is larger where it is smoother.
  • The learning rate is small on steeper places.

solution:

​ Divide the original learning rate by a σ.

insert image description here

There are generally two ways to calculate σ

  • Root Mean Square
  • RMSProp
  • Adam

Root Mean Square

σ is the square root of the gradient sum obtained from previous training.

Each previous gradient contributes equally to σ.

insert image description here

It is assumed that the learning rate of the same parameter is almost the same value, but the learning rate of the same parameter will also change over time. This disadvantage can be avoided by using the RMSProp method.

RMSProp

σ is strongly affected by recent gradients and less affected by past gradients.

insert image description here
insert image description here

Adam

RMSProp + Momentum

The most commonly used optimization method has a well-implemented suite in Pytorch.

insert image description here

Learning Rate Scheduling

When the training is continuously accumulated in a relatively smooth place, the value of σ will be relatively small, making the step size of one movement larger, so once an error occurs, the following situations are prone to occur:

insert image description here

Learning Rate Scheduling can be used to make the size of Learning Rate change with time, so as to eliminate this error:

insert image description here

There are two methods of Learning Rate Scheduling:

  • Learning Rate Decay
  • Warm Up : In some ancient papers, the Warm Up method is often seen. Since σ is a statistical amount of data, it is not accurate at the beginning. So at the beginning, let the learning rate be relatively small, and then increase the learning rate after the σ statistics are more accurate.

insert image description here

Loss

The choice of Loss function will affect the difficulty of Optimization.

For example: in classification problems, Cross-entropy works better than Mean Square Error (MSE), so it is more commonly used.

insert image description here

As can be seen from the above figure, the Mean Square Error is where the Loss is relatively large, and the gradient is small, so the training may stop inside. In Cross-entropy, where the Loss is relatively large, the gradient is also large, which is convenient for training.

Summary of Optimization

insert image description here

Overfitting

Works well on training data, but poorly on testing data.

More training data - the most effective solution

  • Collect new data.

  • Data augmentation : According to your own understanding of the data, transform the original data and increase the amount of training data (flip the picture left and right, zoom in and out)

Reduce the elasticity of the model

  • Less parameters, sharing parameters

  • Less features

  • Early stopping

  • Regularization

  • Dropout

Mismatch

  • Your training and testing data have different distributipns.

  • Be aware of how data is generated.

Guess you like

Origin blog.csdn.net/qq_61539914/article/details/127550290