Machine Learning Notes (4) Model Generalization, Overfitting and Underfitting, L1 Regularization, L2 Regularization

1. Overfitting and underfitting


  • The model trained by the underfitting underfitting algorithm cannot fully express the data relationship
    insert image description here

  • The model trained by the overfitting overfitting
    algorithm expresses too much noise relationship between the data
    insert image description here

2. Learning curve

Learning curve: Take the number of training samples as the abscissa, and take the average score and score interval of the model on the training samples and cross-validation samples as the ordinate, and the curve drawn is the learning curve.
insert image description here

  • The underfitting learning curve
    train and test finally stabilized at a larger position
    insert image description here
  • The best learning curve
    train and test finally stabilized in a small position
    insert image description here
  • Overfitting learning curve
    train and test The final stable position deviation is large
    insert image description here

3. Data division

  • Training set
    The data set used to train the internal parameters of the model. Classfier directly adjusts itself according to the training set to obtain better classification results.

  • The verification set
    is used to test the state and convergence of the model during the training process. The validation set is usually used to tune hyperparameters, and it is determined which set of hyperparameters has the best performance based on the performance of several sets of models on the validation set.
    At the same time, the verification set can also be used to monitor whether the model is overfitting during the training process. Generally speaking, after the performance of the verification set is stable, if you continue to train, the performance of the training set will continue to rise, but the verification set will not increase but decrease. , so overfitting generally occurs.

  • Test set
    The test set is used to evaluate the generalization ability of the model. That is, the previous model uses the verification set to determine the hyperparameters, uses the training set to adjust the parameters, and finally uses a data set that has never been seen to judge whether the model is working.

  • The difference between the three
    images is that the training set is like a student’s textbook. Students master knowledge based on the content in the textbook. The verification set is like homework. Through the homework, you can know the learning situation of different students and the speed of progress. The final test The set is like an exam. The questions on the exam are never seen before, and the students' ability to draw inferences from one instance is tested.

  • Why do we need a test set?
    The training set directly participates in the process of model tuning, and obviously cannot be used to reflect the true ability of the model. In this way, some students who memorize textbooks by rote (overfitting) will have the best grades, which is obviously wrong. In the same way, since the verification set participates in the process of manual tuning (hyperparameters), it cannot be used to judge a model in the end, just like students who brush the question bank cannot be regarded as good students, right? Therefore, it is necessary to pass the final exam (test set) to examine the real ability of a student (model) student (model).

However, it is obviously unreasonable to judge the quality of the model by only one test, so the cross-validation method will be introduced next.

insert image description here

4. Cross Validation

The so-called cross-validation is to select a certain proportion of data as training samples, and other samples as reserved samples. Now obtain the regression equation on the training samples, and then make predictions on the reserved samples. Since the hold-out sample does not involve the choice of model parameters, it can obtain more accurate estimates than new data.
insert image description here

  1. k-folds cross-validation
    Divide the training data set into k parts, called k-folds cross validation. The disadvantage is that every k models are trained, which is equivalent to k times slower overall performance.
  2. The leave-one method LOO-CV
    divides the training data set into m points, which is called Leave-One-Out Cross Validation
    Advantages: It is not affected by randomness at all, and is closest to the real performance index of the model
    Disadvantages: The largest amount of calculation

5. Bias Variance Trade off

  • Bias
    describes the difference between the expected value of the predicted value and the actual value. The larger the deviation, the more it deviates from the real data, as shown in the second row of the figure below.
    Main cause of bias: Incorrect assumptions about the problem itself! Take a chestnut: linear regression is used for non-linear data, which generally behaves as underfitting.
  • Variance
    describes the range of variation of the predicted value, the degree of dispersion, that is, the distance from the expected value. The larger the variance, the more dispersed the data distribution, as shown in the right column of the figure below. A little perturbation of the data can greatly affect the model. Usually, the model used is too complex. Such as high-order polynomial regression, generally manifested as overfitting.
    insert image description here

There are some algorithms that are inherently high variance algorithms. Such as kNN. Nonparametric learning is usually a high variance algorithm. Because no assumptions are made about the data. There are some algorithms that are inherently high bias algorithms. Such as linear regression. Parameter learning is usually a high bias algorithm. Because of extremely strong assumptions about the data. Most algorithms have corresponding parameters that can adjust the bias and variance, such as k in kNN. Bias and variance are often contradictory. Decreasing bias increases variance. Decreasing the variance increases the bias. The main challenge of machine learning comes from variance! Common means of dealing with high variance:

  1. Reduce model complexity
  2. Reduce data dimensionality; denoise
  3. Increase the number of samples
  4. Use the validation set
  5. Model regularization

6. Model regularization Regularization

In simple terms, regularization is a behavior to reduce test error. When we construct a machine learning model, the ultimate goal is to make the model perform well when faced with new data. When you use a more complex model such as a neural network to fit data, it is easy to overfit (the training set performs well, and the test set performs poorly), which will lead to a decline in the generalization ability of the model. At this time , we need to use regularization to reduce the complexity of the model. In linear regression if the parameter θ \thetaIf θ is too large and there are too many features, it will easily cause overfitting, as shown in the following figure:
insert image description here

6.1. Regularization

The emergence of ridge regression and Lasso regression is to solve the problem of over-fitting, and achieve the goal by introducing regularization items in the loss function. In daily machine learning tasks, ridge regression was first used to deal with the case of a large number of features and samples, and is now also used to add bias to the estimate to obtain a better estimate. Here by introducing λ \lambdaλ limits allθ 2 \theta^2i2 , by introducing this penalty term, it can reduce unimportant parameters. This technique is called shrinkage in statistics. Similar to ridge regression, another reduced LASSO also adds a regular term to limit the coefficients.
In order to prevent overfitting (θ \thetaθ is too large), in the objective functionJ ( θ ) J(\theta)J ( θ ) is followed by a complexity penalty factor, that is, a regular term to prevent overfitting. The regular term can useL 1 − norm ( LASSOR egression ) L_{1-norm}(LASSO Regression)L1norm(LASSORegression) L 2 − n o r m ( R i d g e R e g r e s s i o n ) L_{2-norm}(Ridge Regression) L2norm( R i d g e R e g r e s s i o n ) , or combined withL 1 − norm L_{1-norm}L1norm L 2 − n o r m ( E l a s t i c N e t ) L_{2-norm}(Elastic\quad Net) L2norm(ElasticNet)

6.2 Ridge Regression

J ( θ , b ) = J ( θ , b ) + λ 1 2 ∑ i = 1 m θ i 2 J(\theta,b)=J(\theta,b)+\lambda\frac{1}{2}\sum\limits_{i=1}\limits^m\theta_i^2 J(θ,b)=J(θ,b)+l21i=1mii2

6.3、LASSO Regression

J ( θ , b ) = J ( θ , b ) + λ ∑ i = 1 m n ∣ θ i ∣ J(\theta,b)=J(\theta,b)+\lambda\sum\limits_{i=1}\limits^mn|\theta_i| J(θ,b)=J(θ,b)+li=1mnθi

6.4, L1 regularization, L2 regularization and elastic net Elastic Net

  • L1 & L2 norm

First introduce the definition of the norm, assuming xxx is a vector whoseL p L^pLDefinition of p norm:
∣ ∣ x ∣ ∣ p = (∑ i ) ∣ xi ∣ p ) 1 p ||x||_p=(\sum\limits_i)|x_i|^p)^\frac{1}{p }xp=(i)xip)p1
Adding a " penalty item " of a coefficient after the objective function is a common way of regularization, in order to prevent the coefficient from being too large and complicating the model. The objective function after adding the regularization term is:
J ( θ , b ) = J ( θ , b ) + λ 2 m Ω ( θ ) J(\theta,b)=J(\theta,b)+\frac {\lambda}{2m} \Omega(\theta)J(θ,b)=J(θ,b)+2 mlΩ ( θ )
式中,λ 2 m \frac{\lambda}{2m}2 mlis a constant, mmm is the number of samples,λ \lambdaλ is a hyperparameter used to control the degree of regularization.

  • L1 regularization (LASSO)

L 1 L^1L1 regularization, the corresponding penalty isL 1 L1L 1范数:
Ω ( θ ) = ∣ ∣ θ ∣ ∣ 1 = ∑ i ∣ θ i ∣ \Omega(\theta)=||\theta||_1=\sum\limits_i|\theta_i|Ω ( θ )=θ1=iθi

  • L2 regularization (Ridge)

L 2 L^2L2 When regularizing, the corresponding penalty term isL 2 L2L2范数:
Ω ( θ ) = ∣ ∣ θ ∣ ∣ 2 2 = ∑ i θ i 2 \Omega(\theta)=||\theta||_2^2=\sum\limits_i\theta_i^2 Ω ( θ )=θ22=iii2

  • Elastic Net Elastic Net

The corresponding penalty term is L 1 L1L 1 norm andL 2 L2L 2Specify the equation:
J ( θ , b ) = J ( θ , b ) + λ ( ρ ∑ jm ∣ θ j ∣ + ( 1 − ρ ) ∑ jm θ j 2 ) J(\theta, b)=J(\theta,b)+\lambda(\rho\sum\limits_j\limits^m|\theta_j|+(1-\rho)\sum\limits_j\limits^m\theta_j^2);J(θ,b)=J(θ,b)+l ( rjmθj+(1r )jmij2)

  • The difference between L1 regularization and L2 regularization

It can be seen from the above formula that L 1 L^1L1 Regularizationachieves regularization by adding the original objective function tothe sum of the absolute values ​​of all characteristic coefficientsL 2 L^2L2 Regularization realizesregularizationthe sum of squares of all characteristic coefficients
Both limit the parameter size by adding a sum term, but have different effects:L 1 L^1L1 regularization is more suitable forfeature selection, andL 2 L^2L2 Regularization is more suitable forpreventing model overfitting.
Let's start from the perspective of gradient descent and explore the difference between the two.
For the convenience of description, it is assumed that the data has only two features, namelyθ 1 \theta_1i1 θ 2 \theta_2 i2, considering L 1 L^1L1 The objective function of regularization is:
J = J + λ 2 m ( ∣ θ 1 ∣ + ∣ θ 2 ∣ ) J=J+\frac{\lambda}{2m}(|\theta_1|+|\theta_2|)J=J+2 ml(θ1+θ2)
In each updateθ 1 \theta_1i1Let:
θ 1 : = θ 1 − α d θ 1 = θ 1 − α λ 2 msign ( θ 1 ) − ∂ J ∂ θ 1 sign ( x ) = { 1 , x > 0 0 , x = 0 − 1 , x < 0 \theta_1 :=\theta_1-\alpha d\theta_1=\theta_1-\frac{\alpha\lambda}{2m}sign(\theta_1)-\frac{\partial J}{\partial\theta_1}\ qquad sign(x)=\left\{ \begin{aligned} &&1,x>0\\ &&0,x=0\\ &&-1,x<0 \end{aligned} \right.i1:=i1a d i1=i12 ma lsign(θ1)θ1Jsign(x)=1,x>00,x=01,x<0
Young θ 1 \theta_1i1is a positive number, each update will subtract a constant; if θ 1 \theta_1i1If it is a negative number, a constant will be added to each update. All the cases where the coefficient of the feature is 0 are easy to occur . A special coefficient of 0 means that the special will not have any impact on the result, so L 1 L^1L1 Regularization will make the features sparse and play a role in feature selection.
Now considerL 2 L^2L2 Regularized objective function:
J = J + λ 2 m ( θ 1 2 + θ 2 2 ) J=J+\frac{\lambda}{2m}(\theta_1^2+\theta_2^2)J=J+2 ml( i12+i22)
at each updateθ 1 \theta_1i1Dimensions:
θ 1 : = θ 1 − α d θ 1 = ( 1 − α λ m ) θ 1 − ∂ J ∂ θ 1 \theta_1 :=\theta_1-\alpha d\theta_1=(1-\frac{\alpha \lambda}{m})\theta_1-\frac{\partial J}{\partial\theta_1}i1:=i1a d i1=(1ma l) i1θ1J
From the above formula, it can be seen that each time an update is made, the feature coefficient will be scaled proportionally instead of like L 1 L^1L1 regularization minus a fixed value, which will make a coefficient tend to become smaller and not become 0, soL 2 L^2L2 Regularization will make the model simpler, prevent overfitting, and will not play a role in feature selection. The above isL 1 L^1L1 ,L 2 L^2L2 The role and difference of regularization.

Simple understanding of regularization :
1. The purpose of regularization: to prevent overfitting
2. The essence of regularization: constrain (restrict) the parameters to be optimized

Regarding the first point, the value of overfitting is given a bunch of data, this bunch The data is noisy. Using the model to fit this pile of data may also fit the noise data. This is fatal. On the one hand, it will cause the model to be more complicated. On the other hand, the generalization performance of the model is too poor. When you encounter new data for you to test, the overfitting model you get has a poor accuracy rate.
Regarding the second point, the original solution space is the whole area, but some constraints are added through regularization, which makes the solution space smaller, and even the solution becomes sparse under individual regularization methods.
Please add a picture description
The left side of the figure above is Lasso regression, and the right side is Ridge regression. The point of tangency between the red ellipse and the blue area is the optimal solution of the objective function. If it is a circle, it is easy to cut to any point on the circumference, but it is difficult to cut to the coordinate axis, so there is no sparseness; but if it is a rhombus or For polygons, it is easy to cut to the coordinate axis, so it is easy to result in sparse parameters. This also explains why L 1 L_1L1Paradigms will be sparse. This explains why lasso can perform feature selection. Although ridge regression cannot perform feature selection, when it is for θ \thetaThe modulus of θ is constrained so that its value will be relatively small, which greatly reduces the problem of overfitting.
Hereβ 1 , β 2 \beta_1, \beta_2b1, b2They are all parameters of the model. The target parameters to be optimized, the blue area, is actually the solution space. As mentioned above, at this time, the solution space is "shrunk". The smallest β 1 of the objective function, β 2 \beta_1, \beta_2b1, b2. Look at the red circle again, this coordinate axis has nothing to do with features (data), it is completely a parameter coordinate system, on each circle, you can take countless β 1 , β 2 \beta_1, \beta_2b1, b2, this β 1 , β 2 \beta_1,\beta_2b1, b2There is a common characteristic that the objective functions calculated with them are worth equal. The center of the circle is the actual optimal parameter, but because our team has limited the solution space, the optimal solution can only be generated in the "reduced" solution space.
Take two variables as an example to explain the geometric meaning of ridge regression:

  1. When there are no constraints. Model parameters β 1 , β 2 \beta_1, \beta_2b1, b2, has been normalized. Residual sum of squares RSS can be expressed as β 1 , β 2 \beta_1, \beta_2b1, b2, a quadratic function that can be expressed mathematically as a paraboloid.
    Please add a picture description
  2. Ridge returns. The constraint term is β 1 2 + β 2 2 ≤ t \beta_1^2+\beta_2^2\leq tb12+b22t , corresponds to a circle on the projection β1, β2 plane, which is the cylinder in the figure below.
    Please add a picture description
    It can be seen that there is a certain distance between the ridge regression solution and the original least squares solution.

Reference:
https://zhuanlan.zhihu.com/p/35394638
https://www.zhihu.com/question/20448464
https://www.jianshu.com/p/569efedf6985
https://www.jianshu.com /p/569efedf6985
https://www.biaodianfu.com/ridge-lasso-elasticnet.html

Guess you like

Origin blog.csdn.net/qq_45723275/article/details/123789042