Is your model overfitting again? Why not try L1 and L2 regularization

guide

Because L1 regularization and L2 regularization are proposed to alleviate model overfitting, so before that, let's talk about what is overfitting and underfitting.

  • Overfitting
    In the field of machine learning, the process of model training based on data is actually learning the distribution pattern of the training data. Under the constraints of the penalty function, it will try to remember every sample point in the training data set (ie fully learns the distribution of the training data). Let's take linear regression as an example. Linear regression is to fit a function that needs to pass through each sample point as much as possible, as shown in the figure. But as we all know, the artificially collected data set is not perfect, and it will contain noise such as missing values ​​and outliers. If the function also fits these noises, the loss of the model on the training data is very low at this time, which looks very good. Accurate, can identify most of the training data. However, due to noise interference, the model actually evolved in the wrong direction, so the generalization of the model is not very strong at this time, that is, if the model is given a new dataset that has not been seen before, its effect may be very poor. . This is the so-called overfitting, that is, the model over-learns the training data during the training process. However, the training data sometimes cannot represent the overall distribution of the data, so the effect of the model on other data becomes poor, and the generalization performance is not good.
    insert image description here

  • Underfitting
    Underfitting, as the name suggests, is the opposite of overfitting. Overfitting is the overfitting of the model to the training data set, resulting in failure to learn the real data distribution and poor generalization performance. Then underfitting means that even the training data cannot be well fitted, let alone the generalization performance. At this time, the reason is mostly that there is a problem with the data, or the model is too simple. Let's look at the picture and speak, at a glance.
    insert image description here
    After talking about overfitting and underfitting, and knowing their impact on model performance, is there any way to solve them? Let's take a look at the regularization method.

Regularization

Regularization is a general term for a class of methods in machine learning that introduce additional information based on the original loss function to prevent overfitting and improve model generalization performance. Here, the commonly used additional information is generally L1 regularization and L2 regularization, also called L1 norm and L2 norm.

L1 regularization and L2 regularization are additional penalty items in the loss function, that is, to make some restrictions on some parameters in the loss function, so as not to make the model learning too complicated. For the linear regression we exemplified above, linear regression + L1 regularization = Lasso regression, linear regression + L2 regularization = Ridge regression (ridge regression). The loss functions are as follows:

  • Linear Regression + L1 Regularization = Lasso Regression

insert image description here

  • Linear regression + L2 regularization = Ridge regression (ridge regression)

insert image description here

In the formula, ω is the parameter of the model. It can be seen that these two regularizations actually limit the model parameters:

  • L1 regularization refers to the sum of the absolute values ​​of the elements in the weight vector ω, usually representing ||ω||1;
  • L2 regularization refers to the sum of the squares of each element in the weight vector ω and then the square root (we can see that the L2 regularization item of Ridge regression has a square symbol), usually expressed as ||ω||2;
  • The λ in the formula is the degree of restriction of regularization on the model, which is considered to be regulated, that is, what we usually call hyperparameters, which need to be adjusted to determine specific values.

It can be seen from formulas (1) (2) that the training goal of the model is to minimize the loss function. If we add L1 regularization and L2 regularization to the loss function, it means that the parameters of the model are also It should be as small as possible (close to 0), that is, sparse . Let’s talk about sparseness briefly. To put it bluntly, sparseness means that many model parameters are zero, which does not work. Usually in machine learning, when we do classification tasks, the feature dimension of the data is very high, it may be 10K+ level, so many features, not all of them are helpful for us to classify, so we hope that the model is in the training process can automatically find those features that are helpful for its classification, and ignore unimportant features (that is, the weight corresponding to the feature is close to zero or even equal to zero). After such a combination of punches, the model only pays attention to the features with relatively large corresponding parameters. Ignoring features with small parameters is equivalent to performing an automatic feature selection , improving the generalization ability of the model and reducing the risk of overfitting. This is especially the case when the sample size is small .

L1 regularization

As mentioned earlier, regularization is to add constraints to the loss function. From a mathematical point of view, it is to add constraints to the loss function , as shown in formula (3). In this case, the problem is transformed into an optimization problem with constraints, which can be deduced immediately ------>>>>>Use Lagrangian function to solve it, as shown in formula (4).
insert image description here
insert image description here
Suppose ω ∗ ω_*ohλ ∗ λ_*lis the optimal solution of the above optimization problem, according to the KKT condition, formula (5) can be obtained.
insert image description here
We know that L1 regularization is the sum of the absolute values ​​of the parameters. Next, we will discuss the simplest two-dimensional case, that is, there are only two parameters. Then the L1 regularization term L1_norm is equal to | ω 1 ω_1oh1| + | ω 2 ω_2oh2|, draw L1_norm in the coordinate system as shown in the figure, and the contour line of the original loss function (without regularization constraint term) is also drawn in the figure.
insert image description here
According to optimization theory, we know that when L1_norm is equal to the original loss function (that is, the two function curves intersect), the maximum value is obtained. According to the function shape of L1_norm, it has 4 corners, and the two function curves must intersect at one of the corners to obtain the maximum value, and each corner will have a parameter ( ω 1 ω_1oh1ω 2 ω_2oh2) is zero, and then extended to multi-dimensional situations. In the case of multi-dimensional features, L1_norm will have many corners. Common sense knows that the protruding corners are more likely to contact external objects, and many parameters on those corners are zero. , so after adding L1 regularization, many parameters will be zero ( sparse ). The above is why L1 regularization produces a sparse model and is further used for feature selection.

L2 regularization

I finished talking about L1 regularization earlier. Similarly, L2 regularization only changes the form of constraints. Since L2 regularization is the sum of squares, its function image is drawn as a circle, as shown in the figure. Since it erases the edges and corners and becomes smoother than L1 regularization, the probability of its intersection with the original loss function making the parameter zero is much smaller, so the reason why L2 regularization does not have the ability to be sparse.
insert image description here
The reason why L2 regularization prevents overfitting

In the fitting process, it is usually preferred to make the weight as small as possible, and finally construct a model with all parameters relatively small. Because it is generally believed that a model with a small parameter value is relatively simple, can adapt to different data sets, and avoids overfitting to a certain extent. It can be imagined that for a linear regression equation, if the parameter is large, as long as the data is shifted a little, it will have a great impact on the result; but if the parameter is small enough, the data will not be shifted much. What is the impact? A more professional statement is that it has strong anti-disturbance ability.

Summarize

This article mainly explains what is overfitting and underfitting, as well as regularization, and introduces L1 and L2 regularization in detail, and vividly reveals them to sparse the model and prevent it through formula derivation and drawing. Internal causes of overfitting. Hope to help everyone, thank you for browsing. If you have any ideas or questions, please feel free to share in the comment area. If you think the blogger’s writing is okay and helpful to you, you can give it a thumbs up. Give someone a rose, leave a fragrance in your hand~~

Guess you like

Origin blog.csdn.net/Just_do_myself/article/details/118614575