Ridge regression, LASSO regression, elastic network and loss function in machine learning

Today we are going to talk about something purely technical. This thing is the foundation and we cannot explain it later. In the field of machine learning, we can prevent overfitting through regularization. What is regularization? Common ones are ridge regression, LASSO regression and elastic network.

First, let’s talk about what is overfitting? Let’s look at the picture below

Insert image description here
The left 1 is called underfitting, the middle one is better fitted, and the right 1 is overfitting, because it pays too much attention to the local area, causing the graph boundary to be too complicated, and we cannot fit for the sake of fitting, like the right 1 The situation must fit very well in this data, but it will not work in other data, so overfitting has no practicality. The following picture also classifies the data in the same way.
Figure 2:
Insert image description here
We can see how complicated the algorithm on the right side of the picture above is in order to increase the degree of fitting. The purpose of regularization is to suppress complex algorithms to achieve overfitting. This is mainly achieved by compressing the coefficients of the model through a penalty function. Before that, let’s talk about the least squares method and loss function.

The least squares method is widely used in statistics and machine learning.
There are 4 red points in the picture below. We need to fit the trend of these 4 points. Then we need to find a line. Assuming we have found it, it is the blue line in the picture below. Then the Y of the position of the red dot is the actual value of Y, and the Y value of the blue line on the same X-axis is the predicted value of Y. It is understandable. The difference between the actual value of Y and the predicted value of Y is our error, also called the residual in statistics, which is the green part.

Insert image description here
So how do we find the most appropriate line to fit the trend of 4 points, which is
Y1 (actual value) - Y1 (error value) + Y2 (actual value) - Y2 (error value) + Y3 (actual value) - When Y3 (error value) + Y4 (actual value) - Y4 (error value) equals the minimum.
That is when the sum of the lengths of all green lines is the smallest. Because sometimes the actual value minus the average value may have a negative value, so we square their differences and then add them together, which is
(Y1 (actual value) - Y1 (error value)) 2+ (Y2 (actual value) )-Y2 (error value)) 2+ (Y3 (actual value)-Y3 (error value)) 2+ (Y4 (actual value)-Y4 (error value)) 2 In data statistics, we can use Y_i to represent
Y The actual value of

Insert image description here
In machine learning, the function combined with optimized parameters is called the loss function. In this example, our sum of squares function is the loss function, which is actually the function that minimizes the error. We find a line that minimizes the sum of all squares, which is our most appropriate fitting line.

The situation just now is relatively simple. When it is complicated, such as the situation on the right in Figure 2, overfitting may occur in order to minimize the loss function. Common regularization methods we use to prevent overfitting include ridge regression, LASSO regression and elastic networks. Now let’s talk about Xialing’s return online.
Ridge regression adds a penalty function, called L2 norm, to the loss function ① just now, which is accumulated after squared β, and then multiplied by the coefficient λ. It should be noted here that the coefficient here does not include the intercept term.

Insert image description here
After adding the L2 norm, the formula of the loss function is as follows:
Insert image description here
We can see that after adding the parameter λ to the L2 norm, if there are many β in the model, the value will increase after the square of β is accumulated. At this time, λ can compress the accumulated sum of β squared and optimize our loss function. The function λ is a hyperparameter and cannot be estimated from the data. An optimal λ can only be obtained through continuous iterations of model cross-validation. In our previous article "Teach you step by step how to use R language to do LASSO regression" we can see the following picture (the picture below is generated by LASSO regression, but the principle is the same). In fact, a λ keeps getting larger,
Insert image description here
cross-validated, and then the best result is obtained. A process of optimal performance, in which the coefficients can be seen to be compressed.

Insert image description here
From the above figure, we can find that when λ=0, the coefficients of β1 and β2 are both 1.5. In this way, when λ is further compressed into the circle, the coefficients of β1 and β2 become smaller, and the coefficient of β2 at the point selected by ridge regression has been compressed to 0.6, so using ridge regression avoids overfitting of the training model to the data.
Next, let’s talk about LASSO regression. What is the difference between it and ridge regression? Ridge regression uses the L2 norm, while LASSO uses the L1 norm. The formula is as follows:

Insert image description here
The norm of ridge regression is the accumulation of squares of β coefficients, while LASSO is the accumulation of the absolute values ​​of β coefficients. The following is the formula of the LASSO regression loss function. So the question is, what is the difference between these two functions
Insert image description here
? Ridge regression will compress the coefficients, but the coefficients will not be compressed to 0, but LASSO regression can compress the coefficients to 0. This can help us filter indicators.

Insert image description here
We can see that the penalty function of ridge regression is circular, while the penalty function of LASSO regression is diamond-shaped. In the same lambda ridge regression, the coefficient of β2 is 0.6, while the coefficient of β2 in LASSO regression is 0, indicating that β2 has was removed.

Finally, let’s talk about what an elastic network is. The elastic network is a packaged combination of ridge regression and LASSO regression. Its formula is as follows: It can be seen that the elastic network introduces the hyperparameter
Insert image description here
α. When α is 0, the elastic network It is equal to ridge regression. When α is equal to 1, it is equal to LASSO regression. Therefore, it has the advantages of ridge regression and LASSO regression and is more popular.
At this point, the theory is basically over. Talking about theory is really time-consuming. Let’s take a look back at the following article by combining the glmnet package with my article "Teaching You Step by Step to Use R Language to Do LASSO Regression".
Let’s take a look at the explanation of the parameters of the glmnet function in the glmnet package.

Insert image description here
Mainly depends on the parameter alpha. 1 is lasso and 0 is ridge penalty. Is it very similar to our elastic network?

Insert image description here
Next, let’s take a look at the graph of the compression coefficient generated in LASSO.

cvfit=cv.glmnet(x,y)
plot(cvfit)

Insert image description here
We can see that the cv.glmnet function generates a plot of the MSE of this lambda change,

Insert image description here
We can see that the cv.glmnet function is a cross-validation function. It calculates the MSE through the provided lambda value and finally obtains an optimal solution. This confirms what we said above.

Finally, to summarize, today we have a preliminary introduction to ridge regression, LASSO regression, elastic network and loss function. The loss function is an important part of machine learning. There is no way to talk about it without mentioning this. When we have time later, we will introduce how to manually derive the results of logistic regression or linear regression in R language. After understanding the principle and how regression generates results, many problems in R can be easily solved, such as the following error: 1: glm.fit: The algorithm does not aggregate 2: glm.fit: The fitting probability is calculated to be a value of zero or one

Insert image description here
Insert image description here
These are all caused by the model not converging. Let’s talk about it when we have time.

Guess you like

Origin blog.csdn.net/dege857/article/details/132784205