Learning to check in series Day3 - Li Hongyi Machine Learning (October)

Table of contents

1. Sources of error 

2. Model selection

3. Gradient Descent 

1. Learning rate

2. Stochastic Gradient Descent

3. Feature Scaling


1. Sources of error 

  • Deviation: High deviation indicates that the model is underfitting, and the model results need to be adjusted. At this time, only increasing the amount of data does not improve the accuracy of the model
  • Variance: High variance indicates that the model is overfitting. The simplest operation is to increase the amount of data, or use regularization operations to alleviate the phenomenon of overfitting

2. Model selection

Incorrect operation: simply divide the data set into two parts, the training set and the test set. The training set trains the model, and the test set tests the model. The model that performs best on the test set is used for the final test set. Usually, the error will become larger. The possible reason is that if the model is overfitting, it performs very well on the internal test set, but the data distribution of the real test set is slightly different from that of the internal test set, which makes the model unable to analyze the real test set well, resulting in large errors.

 Correct operation:

  • Cross-validation: Divide the data set into training set, verification set and test set according to a certain ratio, which are independent of each other, train different models, and select the model with the best performance on the verification set to test the internal and external test sets, usually the error will be High, but the strategy for selecting models is relatively objective.

 However, some people will say that the division of the verification set is subject to some degree, which may cause bias in the model selection. Therefore, N-fold cross-validation can be used, and some relevant literature can be consulted. In the field of medical image analysis, 5-fold or 10-fold cross-validation should be used more often. N-fold cross-validation, the data set is divided into N groups, one group is used as the verification set, and N-1 group is used as the training set, and the model is evaluated on the verification set, and the mean value of the error value of the N verification set is taken. Choose the best model. This method realizes the performance of the full data set evaluation model, which is more objective and worth having!

3. Gradient Descent 

1. Learning rate

In actual operation, in order to set an appropriate learning rate, it is necessary to draw the curve on the right:

  • Yellow curve: learning rate is too large
  • Blue curve: The learning rate is too small
  • Green curve: The learning rate is large, making the gradient converge too quickly, which may lead to model underfitting
  • Red curve: Relatively appropriate learning rate

Adaptive learning rate:

Different Epochs have different learning rates. At the beginning, a larger learning rate is used. As the Epoch increases, the learning rate is reduced, so that Loss continues to approach the optimal value.

  • Adagrad: The learning rate of each parameter is divided by the root mean square of the previous differential

2. Stochastic Gradient Descent

There is no need to process all the data, just randomly select one of the samples and update the weights, thus making the model converge quickly.

3. Feature Scaling

Similar to variable standardization, in regression analysis, independent variables of different dimensions need to be standardized before being included.

 Left: Before feature scaling; Right: After feature scaling (recommended)

Why scale features?

It can be seen from the figure that the red arrow is the direction of gradient descent, and the gradient descent in the left figure performs weight iteration along the normal direction of the contour line, instead of going directly to the lowest point, detours are made, and different variables require different learning rates; In contrast, the gradient descent on the right is straightforward and more efficient.

How to scale features? The essence is normalization.

Learning experience:

Gradient descent is the most basic part of deep learning algorithms. Although AI models have been trained, the most theoretical and basic theoretical part of gradient descent is still a bit unclear, and you still need to continue learning!

おすすめ

転載: blog.csdn.net/weixin_41698730/article/details/120792321