Why does L2 regularization make the model simpler?

The phenomenon of model overfitting is as follows: the error on the training set is small, and the error on the test set is large. And the author also said that the reason for the over-fitting phenomenon is that there is a certain amount of noise in the training data, and in order to fit every sample point (including noise) as much as possible, we often use complex models. In the end, the trained model is largely affected by the noise data. For example, the real sample data may be more in line with a straight line, but due to the influence of individual noise, the trained model is a curved curve, which makes the model in the test set The performance is bad. Therefore, we can think of this process as a bad training set leading to bad generalization errors. But just looking at the form of overfitting, a bad test set (noisy) can also lead to bad generalization errors. Next, this article will introduce how the most commonly used regularization in the **regularization** method L_{2}solves this problem from these two perspectives .

1 regularization

Linear regression, for example, we now assume a linear regression to the back of the objective function plus a L_{2}regularization term, to see what changes will occur:

The red part in formula (1) is our newly added L_{2} regularization term. What does it do? According to the previous introduction, when the error between the real value and the predicted value is smaller (shown as the loss value tends to 0), it means that the prediction effect of the model is better, and it can be achieved by minimizing the objective function For this purpose. It can be seen from formula (1) that in order to minimize the objective function J, the result of the red term will also gradually tend to zero. This makes the final optimization solution w_{j}all tend to near 0, and then a smooth prediction model is obtained. What are the advantages of doing this?

1.1 Bad test set leads to bad generalization error

The so-called bad test set leading to bad generalization error refers to: the training set itself does not have much noise, but because the test set contains a lot of noise, the trained model does not have enough generalization ability on the test set, which produces a large The error. This situation can be seen as the model is too accurate and over-fitting occurs. How does the regularization method solve this problem?

If formula (2) is the prediction model trained according to the objective function in (1):

      

Now that a certain feature dimension of the new input sample (including noise) has x_{j}changed from the one during training to the current one \left (x_{j} + \Delta x_{j}\right ), then its prediction output has changed from the one during training andto the current one y + \Delta x_{j}w_{j}, that is , the resulting \Delta x_{j}w_{j}error. However, because it is w_{j}close to the 0neighborhood, it will only produce a small error in the end. And ifw_{j}​ is closer 0, the error generated is smaller, which means that the model is more resistant to noise interference, which improves the generalization ability of the model to a certain extent [1].

From this we can know that by adding a regularization term to the original objective function, the parameters obtained by training can be smoothed, thereby making the model less sensitive to noise data and alleviating the phenomenon of model overfitting.

1.2 Bad training set leads to bad generalization error

The so-called bad training set leads to bad generalization error refers to: because the training set contains some noise, we use a more complex model in order to minimize the objective function as much as possible during the training process, so that the final model cannot be obtained. It has good generalization ability on the test set. However, this situation is completely due to the inaccuracy of the model and the phenomenon of over-fitting, which is also the most common cause of over-fitting. So L_{2} how can the regularization method reduce the sensitivity to noisy data during the training process? In order to facilitate the later understanding, we first intuitively understand what regularization does to the objective function from the figure.

As shown in the figure, the red curves on the left and right sides are the original objective function, and the blue curve is L_{2}  the objective function after regularization. It can be seen that the extreme values ​​of the red curve have all changed, that is, the extreme points that produce the extreme values ​​have all changed, and they are all closer to the origin. Let's look at a projection map of contour lines:

The red curve in the figure is also the contour line of the original objective function, and the blue curve is the contour line of the objective function after applying regularization. It can be seen that the extreme point of the objective function has also changed, from the original \left ( \frac{1}{2} , \frac{1}{2}\right )to \left ( \frac{1}{16} , \frac{1}{4}\right )and also closer to the origin ( w_{1}w_{2}becomes smaller). At this point, we seem to be able to find that ** regularization can change the extreme points of the original objective function and at the same time make the parameters smaller. **In fact, it is for this reason that regularization has the effect of alleviating over-fitting, but what is the reason?

Taking the objective function J_{1} = \frac{1}{6}\left ( w_{1} - \frac{1}{2}\right )^{2} + \left ( w_{2} - \frac{1}{2}\right )^{2}as an example, the extreme point at which it obtains the extreme value is \left ( \frac{1}{2} , \frac{1}{2}\right ), and J_{1}  the gradient at the extreme point is \left ( 0 ,0 ). When applying regularization to it R=\left (w_{1}^{2} + w_{2}^{2})\right, since Rthe gradient direction is far away from the origin (because it Ris a quadric surface), adding regularization to the objective function is actually equivalent to applying a gradient away from the origin to the objective function [ 2]. In layman's terms, regularization applies a gradient away from the origin to the extreme points of the original objective function (it can even be imagined as applying a force). Therefore, this also means that for the objective function after regularization J_{2} = J_{2} + R, J_{2}
 the extreme point (the blue dot in J_{1} the figure above) is closer to the origin than the extreme point (the red origin in the figure above). And this is the essence of regularization. After finishing the preparation, enter the main topic below:

1.3 Simple model

Through the above introduction, we now know how L_{2}regularization can simplify the model. But some friends may still have a doubt (or misunderstanding): they think that the model represented by a high-degree polynomial must be more complicated than a model represented by a low-degree polynomial. For example, a fifth-degree polynomial is more complicated than a second-degree polynomial, and this is wrong [ 3]. Higher-order terms represent a larger model space, which includes both complex models and simple models. And we only need to adjust the weight parameter of the corresponding position of the complex model to be closer to 0, then it can be simplified.

2 parameter update

After applying regularization to the objective function, it means that the gradient of the parameter has changed. Fortunately, the regularization is added to the original objective function, so wthe gradient of the parameter is only added to the corresponding gradient. The parameters and the gradient of the bias b have not changed at the same time. Here's summarize linear regression and logistic regression algorithm plus    
L_{2} changes after regularization.

2.1 Linear regression

                            

The red part is L_{2} the change after regularization is added .

 

2.2 Logistic regression

               

The red part is L_{2} the change after regularization is added .

2.3 Gradient descent

                                     

The red part is L_{2}  the change after regularization is added . It can be seen that L_{2}regularization causes the weight Wto multiply itself by a number less than 1, and then subtract the gradient without penalty. Therefore, L_{2}   regularization is also called weight decay [4].

3 summary

In this article, the author first introduced in detail how to use L_{2} regularization methods to alleviate the over-fitting phenomenon of the model through examples , and introduced why L_{2}regularization can make the model simpler. Secondly, the author introduces the changes of the original gradient update formula after adding regularization, which is only the gradient corresponding to the added regularization term. Finally, the author shows the L_{2} effect of regularization through an example . But at the same time, the author of this article only introduces the most frequently used L_{2} regularization, such as L_{1}regularization, etc. Readers can refer to it by themselves. This is the end of this content, thanks for reading!
 

Guess you like

Origin blog.csdn.net/devil_son1234/article/details/107161549