Principle of linear regression regular term (penalty term), classification of regular terms and implementation of Python code

Preliminary knowledge for this blog:
The solution and derivation of the linear regression least squares method and the underlying code based on Python to implement
the principle of linear regression feature expansion and the implementation of the python code

1 The meaning of regular terms

In linear regression, regularization is a technique used to control the complexity of the model by adding the size of the coefficients to the loss function to limit the complexity of the model. In linear regression, L1 regularizer or L2 regularizer is usually used. The form of the regular term can be expressed as:

L1 regular term (Lasso):

L 1 = λ ∑ i = 1 p ∣ w i ∣ L_{1} = \lambda \sum_{i=1}^{p} \left| w_i \right| L1=li=1pwi

L2 regular term (Ridge):

L 2 = λ ∑ i = 1 p w i 2 L_{2} = \lambda \sum_{i=1}^{p} w_i^2 L2=li=1pwi2

Among them, ppp is the number of coefficients,wi w_iwiThis is number iii coefficients,λ \lambdaλ is a regularization parameter used to control the strength of regularization.

The L1 regularization term uses the sum of the absolute values ​​of the coefficients as a regularization term, which can cause some coefficients in the model to become 0, thereby achieving the effect of feature selection. This means that the L1 regularizer can select the most important features in the model, thereby improving the generalization ability of the model.

The L2 regularization term uses the sum of the squares of the coefficients as a regularization term, which can prevent the coefficients in the model from being too large, thereby reducing overfitting of the model. L2 regularizer is widely used in many machine learning tasks.

Taking the L2 regular term as an example, before adding the regular term, the loss function is: J ( w ) = 1 2 m ∑ i = 1 m ( hw ( x ( i ) ) − y ( i ) ) 2 J(w) = \ frac{1}{2m} \sum_{i=1}^{m} \left(h_w(x^{(i)}) - y^{(i)}\right)^2J(w)=2m _1i=1m(hw(x(i))y(i))2After
adding the regular term, the loss function becomes:J ( w ) = 1 2 m ∑ i = 1 m ( hw ( x ( i ) ) − y ( i ) ) 2 + λ 2 m ∑ j = 1 nwj 2 J (w) = \frac{1}{2m} \sum_{i=1}^{m} \left(h_w(x^{(i)}) - y^{(i)}\right)^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} w_j^2J(w)=2m _1i=1m(hw(x(i))y(i))2+2m _lj=1nwj2
Among them, hw ( x ( i ) ) h_w(x^{(i)})hw(x( i ) )is the predicted value,y ( i ) y^{(i)}y( i ) is the actual value,wj w_jwjIt’s jjThe weight of j features, λ \lambdaλ is a regularization parameter used to control the strength of regularization.

This means that during model fitting, a balance point between the performance of the model on the training set and the complexity of the model will be found. In practice, cross-validation is often used to select the optimal regularization parameter λ \lambdaλ to obtain the best performance and generalization ability. Regularization is a very effective technique that can help solve the overfitting problem and improve the generalization ability of the model.

2 The difference between L1 and L2 regular terms

In addition to the algorithmic differences between L1 and L2 mentioned earlier, there is another important difference in that the solution of the L1 regular term is not unique, while the solution of the L2 regular term is unique. This is because the regularization graph of the L1 regularization term is a diamond with its boundary on the coordinate axis, while the regularization graph of the L2 regularization term is a circle and its boundary is a smooth curve. This difference means that the solution to the L1 regularization term may not be unique in certain situations. See the picture below for details:
Page 22-4.PNG
At the same time, this picture also explains why L1 regularization can select the most important features in the model, because its intersection points are more likely to appear on the coordinate axis, and the points on the coordinate axis mean a certain θ \ thetaθ becomes 0.

It is usually difficult to say which one is better, L1 or L2 regularity, so Elastic Net regularization technology emerged, which combines L1 regularity and L2 regularity.

The regularization term of Elastic Net can be expressed as:

L 1 , 2 = λ 1 ∑ i = 1 p ∣ w i ∣ + λ 2 ∑ i = 1 p w i 2 L_{1,2} = \lambda_1 \sum_{i=1}^{p} \left| w_i \right| + \lambda_2 \sum_{i=1}^{p} w_i^2 L1,2=l1i=1pwi+l2i=1pwi2

Among them, λ 1 \lambda_1l1and λ 2 \lambda_2l2are the regularization parameters of L1 regular term and L2 regular term respectively, ppp is the number of coefficients,wi w_iwiThis is number iii coefficients.

The main advantage of Elastic Net is that it can overcome the respective shortcomings of L1 and L2 regularization terms and take advantage of their advantages at the same time. By adjusting the weights of the L1 and L2 regularization terms, the trade-off between feature selection and smoothing effects can be controlled, thereby achieving better performance and generalization capabilities.

3 Regular python implementation

3.1 Lasso regular

from sklearn.linear_model import LassoYou can import the Lasso regression package by using it . The subsequent modeling sequence is as follows:

  1. Import the required libraries and data (like the previous blog, Boston house prices are also used here)
from sklearn.linear_model import Lasso
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
y = boston.target
  1. Create Lasso model object
lasso = Lasso(alpha=1.0)
  1. Fit model
lasso.fit(X, y)
  1. Access the coefficients of the model
print(lasso.coef_)
  1. predict
y_pred = lasso.predict(X)
  1. Model performance evaluation
from sklearn.metrics import r2_score
print(r2_score(y, y_pred))

The output of this function is:
image.png
As you can see, part θ \thetaThe value of θ is 0, and adding Lasso regularization successfully plays the role of feature screening.

3.2 Ridge regularity

Ridge regression can be used in python by using from sklearn.linear_model import Ridge. The subsequent model training and use are exactly the same as Lasso. It is very simple and will not be explained again here.
Also using the Boston data set, the code result of Ridge regularization is: As
image.png
you can see, this time θ \thetaθ no longer has a zero value, but the θ \thetacorresponding to unimportant featuresθ becomes very small. It plays the role of preventing the coefficients in the model from being too large as mentioned earlier, thereby reducing the overfitting of the model.

3.3 Elastic Net Regularity

Elastic Net regression can be used in python by using from sklearn.linear_model import ElasticNet. Compared with the previous two methods 1, Elastic Net has one more receiving parameter in the creation class:

elastic_net = ElasticNet(alpha=1.0, l1_ratio=0.5)

The L1_ratio parameter controls the weight between L1 regularization and L2 regularization, with a value between 0-1. Taking 1 is equivalent to Lasso, and taking 0 is equivalent to Ridge. The other methods are exactly the same as above.

4 Case examples

Below we use the Boston housing price data set to compare the performance of the three regularities.
First, we import the data and perform a second-order polynomial expansion on the data. The binomial expansion here is first to review the content of the previous blog, and secondly to simulate the situation of too many useless features encountered in daily projects.

import numpy as np  
from sklearn.datasets import load_boston  
from sklearn.preprocessing import PolynomialFeatures  
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet  
from sklearn.model_selection import cross_val_score, KFold  
  
# 加载波士顿房价数据集  
boston = load_boston()  
X = boston.data  
y = boston.target  
  
# 进行2阶多项式扩展  
poly = PolynomialFeatures(degree=2, include_bias=False)  
X_poly = poly.fit_transform(X)

Subsequently, define the LinearRegression, Lasso, Ridge, and ElasticNet models respectively. The penalty coefficient of Ridge regression here is very high, much higher than the other two models. This is because the blogger has already tried this data set and found a suitable hyperparameter range. In actual projects, it is recommended to perform grid search (GridSearch) based on actual conditions.

# 定义LinearRegression、Lasso、Ridge、ElasticNet模型  
lr = LinearRegression()  
lasso = Lasso(alpha=1.0)  
ridge = Ridge(alpha=100.0)  
elastic_net = ElasticNet()

In order to evaluate the model more objectively, we use 5-fold cross-validation to evaluate the performance of the model.

# 定义5折交叉验证  
cv = KFold(n_splits=5, shuffle=True, random_state=1)  
  
# 使用交叉验证评估模型性能  
scores_lr = cross_val_score(lr, X_poly, y, scoring='r2', cv=cv)  
scores_lasso = cross_val_score(lasso, X_poly, y, scoring='r2', cv=cv)  
scores_ridge = cross_val_score(ridge, X_poly, y, scoring='r2', cv=cv)  
scores_elastic_net = cross_val_score(elastic_net, X_poly, y, scoring='r2', cv=cv)

Finally, the score of each model is calculated and output, and the best model is selected.

# 计算交叉验证的R平方分数  
r2_lr = np.mean(scores_lr)  
r2_lasso = np.mean(scores_lasso)  
r2_ridge = np.mean(scores_ridge)  
r2_elastic_net = np.mean(scores_elastic_net)  
  
# 输出每个模型的R平方分数  
print('Linear Regression R2:', r2_lr)  
print('Lasso R2:', r2_lasso)  
print('Ridge R2:', r2_ridge)  
print('ElasticNet R2:', r2_elastic_net)

The overall output of this code is:
image.png

It can be seen that the performance of each model is not much different. This is mainly because the Boston data set is sufficient and relatively standardized, and the performance of each model is relatively good. At the same time, we did not choose the best hyperparameters for each model.

Finally, I would like to remind everyone that if you need to use the model, you need to take out the best model and hyperparameters and retrain. Because during the cv process, we got 5 models and we can't judge which one is the best. It is best to train again using all the data sets.

Click here to download the complete code for this article for free

Guess you like

Origin blog.csdn.net/nkufang/article/details/129648037