A brief introduction to regularization

Date: 2020-07-16

Author: The 18th president of the CYL
Label: Machine learning regularization L1, L2
  • What is regularization (regularization):

The intuitive feeling is to add an extra term after the loss function. Usually the term is composed of L1 norm or L2 norm, also known as L1 regularization term and L2 regularization term. (Note: There are also other forms of regularization)

L1 regularization term: the sum of the absolute values ​​of each element in the weight vector w, multiplied by the coefficient
Insert picture description here
L2 regularization term: the square root of the sum of the squares of each element in the weight vector w, and then multiply by the coefficient
Insert picture description here

  • Regularization:

L1 regularization: can generate a sparse weight matrix, that is, a sparse model can be used for feature selection (Question 1: Why can a sparse matrix be generated for feature selection, Question 2: Why can a sparse matrix be generated)
L2 regularization: Can help prevent overfitting of the model (Question 3: Why help prevent overfitting)

  • Solve the problem

    • Question 1: The relationship between sparse matrix and feature selection:

    A sparse matrix refers to a sparse matrix of coefficients, in other words a sparse matrix of weights, that is, a matrix with most of the weights being 0. This matrix indicates that most of the features have no contribution to the model, or the contribution is relatively small, then the neurons that contribute to the model can be screened out.

    • Question 2: Why can a sparse matrix be generated (why add the sum of the coefficients of the matrix to make the useless weights zero)
    • Step 1: Simplify the loss function
      Insert picture description here
    • Step 2: Consider the case w1, w2 with only two weights, then let L = α(|w1|+|w2|). The mathematical meaning of the original formula is transformed into finding the minimum solution of J0 under the constraint of L. (Crazy thinking about how to learn high numbers here)
    • Step 3: The graphic circle is the contour of J0, and the diamond is L
      Insert picture description here
    • Step 4: Find that the case of the minimum solution (the first intersection point, as to why the intersection point is the best, thinking about high numbers crazy) is always the sharp position of the image of L. (Feature: coordinate axis, in other words a feature is 0)
  • Question 3. Why does L2 regularization help prevent overfitting

    The simplification step is omitted, and
    Insert picture description here
    it is found that the optimal solution part is probably the non-coordinate axis part, then all the weights are not easy to be 0, (losing the advantage of feature selection), but because of the L2 regularization, the parameters can be compared It is small, so it is not easy to overfit (Imagine if the weight of a certain parameter is particularly large, then an input change will inevitably change the output of the entire model. Another way of understanding is that the model "remembers this value", leading to generalization A batch of capacity garbage, poor anti-disturbance ability, the intuitive performance is that the accuracy of the training set is OK, but the test set is not OK, that is, over-fitting), here is the question 4: why adding L2 regularization can make the optimal solution Parameters are generally relatively small

  • Question 4: Why adding L2 regularization can make the parameters of the optimal solution generally smaller

    • Gradient descent (review) is to let the weights "step" along the negative direction of the gradient
    • The expression of gradient descent with regularization term becomes: (λ is the coefficient of regularization term) the Insert picture description here
      expression of gradient descent without regularization term is:Insert picture description here
    • It can be seen that in each gradient descent process, the weight will be multiplied by a number less than 1.

Say a little more about the conclusion of regularization:

Selection of L1 regularization coefficients
• The larger the coefficient, the easier it is to make the matrix sparser
. The choice of L2 regularization coefficients
• The larger the coefficient, the faster the weight decays and the smaller the parameter. If it is too small, it will be under-fitting. Fitting
regularization is not
limited to these two types • There are some regularization operations such as Dropout (used in the AlexNet model), which will be explained later in conjunction with the original AlexNet paper

Guess you like

Origin blog.csdn.net/cyl_csdn_1/article/details/108685706