Multiple Linear Regression - Lasso

Table of contents

1. Lasso and multicollinearity

2. The core role of Lasso: feature selection

3. Select the best regularization parameter value 


1. Lasso and multicollinearity

        The full name of Lasso is the Least absolute shrinkage and selection operator. Because the name is too complicated, it is called Lasso for short. Like ridge regression, Lasso is an algorithm used to act on multicollinearity, but Lasso uses The L1 paradigm of the coefficient w(the L1 paradigm is wthe absolute value of the coefficient) multiplies the coefficient \alpha, so the loss function expression of Lasso is:

min\left \| Xw-y \right \|_{2}^{2}+\alpha \left \| w \right \|_{1}

The derivation process of Lasso:

w\alphaIn ridge regression, an identity matrix can be added to the square  matrix through the regularization coefficient X^{T}Xto prevent X^{T}Xthe determinant of the square matrix from being 0, and the regular term brought by the L1 paradigm does not have this term \alphaafter derivation w, so it can't have X^{T}Xany impact on it, that is to say, Lasso can't solve the problem of "exact correlation" between features. When we use the least squares method to solve linear regression, if the linear regression has no solution or reports a division by zero error, changing Lasso can't solve it any problem.

Ridge regression vs Lasso

Ridge regression can solve the problem that the precise correlation between features makes the least squares method unusable, but Lasso cannot.

 In reality, in fact, the multicollinearity problem of "exact correlation" is rarely encountered. Most of the multicollinearity should be "highly correlated". If it is assumed that the inverse of the square matrix must exist, then there can X^{T}Xbe :

w=X^{T}X^{-1}(X^{T}y-\frac{\alpha I}{2})

By increasing \alpha, we can wadd a negative term to the calculation increase, thereby limiting wthe size of the parameter estimation, and preventing the parameters caused by multicollinearity wfrom being overestimated and causing the model to be inaccurate. Lasso does not fundamentally solve the problem of multicollinearity linearity problem, but to limit the impact of multicollinearity . What's more, this is still under the assumption that all coefficients are positive. Assuming that the coefficients wcannot be positive, it is very likely that the regularization parameter needs to be \alphaset to negative, so \alphanegative numbers can be removed, and the larger the negative number, the more limited the collinearity. bigger. 

A core difference between L1 and L2 regularization is their weffect on the coefficients: the size of both regularizations will be compressed w, and the coefficients of features that contribute less to the label will be smaller and more likely to be compressed. However, L2 regularization will only compress the coefficients as close to 0 as possible, but L1 regularization dominates sparsity, so it will compress the coefficients to 0. This property makes Lasso the first choice for feature selection tools in linear models.

2. The core role of Lasso: feature selection

class sklearn.linear_model.Lasso(alpha=1.0,fit_intercept=True,normalize=False,precompute=False,copy_X=True,max_iter=1000,tol=0.001,warm_start=False,random_state=None)

 In sklearn, the class Lasso is used to call Lasso regression. Among the many parameters, the parameter alpha is more important. The regularization coefficient setting uses the parameter positive. When this parameter is True, it means that the coefficient returned by Lasso must be a positive number.

Examples of California housing data:

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge,LinearRegression,Lasso
from sklearn.model_selection import train_test_split as TTS
from sklearn.datasets import fetch_california_housing as fch
import matplotlib.pyplot as plt
housevalue=fch()
x=pd.DataFrame(housevalue.data)
y=housevalue.target
x.columns=["住户收入中位数","房屋使用年代中位数","平均房间数目","平均卧室数目","街区人口","平均入住率","街区的纬度",
         "街区的经度"]
xtrain,xtest,ytrain,ytest=TTS(x,y,test_size=0.3,random_state=420)
for i in [xtrain,xtest]:
    i.index=range(i.shape[0])
# 线性回归拟合
reg=LinearRegression().fit(xtrain,ytrain)
(reg.coef_*100).tolist()
[43.735893059684,
 1.0211268294493827,
 -10.780721617317681,
 62.643382753637766,
 5.2161253534695196e-05,
 -0.3348509646333501,
 -41.3095937894772,
 -42.62109536208474]
# 岭回归进行拟合
Ridge_=Ridge(alpha=0).fit(xtrain,ytrain)
(Ridge_.coef_*100).tolist()
[43.73589305968356,
 1.0211268294493694,
 -10.780721617316962,
 62.6433827536353,
 5.2161253532548055e-05,
 -0.3348509646333529,
 -41.30959378947995,
 -42.62109536208777]
# Lasso进行拟合
lasso_=Lasso(alpha=0).fit(xtrain,ytrain)
(lasso_.coef_*100).tolist()
<ipython-input-11-69dcd6f67a03>:2: UserWarning: With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimator
  lasso_=Lasso(alpha=0).fit(xtrain,ytrain)
D:\Anaconda\Anaconda\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:648: UserWarning: Coordinate descent with no regularization may lead to unexpected results and is discouraged.
  model = cd_fast.enet_coordinate_descent(
D:\Anaconda\Anaconda\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:648: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.770e+03, tolerance: 1.917e+00 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
  model = cd_fast.enet_coordinate_descent(
[43.73589305968403,
 1.0211268294494058,
 -10.780721617317653,
 62.643382753637724,
 5.2161253532678864e-05,
 -0.33485096463335745,
 -41.30959378947717,
 -42.62109536208475]

The error contents are:

1. The regularization coefficient is 0, so the algorithm cannot converge! If you want the regularization coefficient to be 0, use linear regression

2. The coordinate descent method without a regularization term may lead to unexpected results, which is not encouraged!

3. The objective function is not converging, you may want to increase the number of iterations, fitting the model with a very small alpha may cause accuracy problems.

set alpha to 0.01

# 岭回归进行拟合
Ridge_=Ridge(alpha=0.01).fit(xtrain,ytrain)
(Ridge_.coef_*100).tolist()
[43.73575720621553,
 1.0211292318121377,
 -10.78046033625102,
 62.64202320775469,
 5.217068073227091e-05,
 -0.3348506517067568,
 -41.309571432294405,
 -42.621053889327314]
# Lasso进行拟合
lasso_=Lasso(alpha=0.01).fit(xtrain,ytrain)
(lasso_.coef_*100).tolist()
[40.10568371834486,
 1.093629260786014,
 -3.7423763610244563,
 26.524037834897197,
 0.00035253685115039417,
 -0.32071293948878005,
 -40.064830473448424,
 -40.81754399163315]

Increase the regularization coefficient

# 岭回归进行拟合
Ridge_=Ridge(alpha=10**4).fit(xtrain,ytrain)
(Ridge_.coef_*100).tolist()
[34.62081517607707,
 1.5196170869238759,
 0.3968610529209999,
 0.915181251035547,
 0.0021739238012248533,
 -0.34768660148101127,
 -14.73696347421548,
 -13.435576102527182]
# Lasso进行拟合
lasso_=Lasso(alpha=10**4).fit(xtrain,ytrain)
(lasso_.coef_*100).tolist()
[0.0, 0.0, 0.0, -0.0, -0.0, -0.0, -0.0, -0.0]
# Lasso进行拟合
lasso_=Lasso(alpha=1).fit(xtrain,ytrain)
(lasso_.coef_*100).tolist()
[14.581141247629423,
 0.6209347344423876,
 0.0,
 -0.0,
 -0.00028065986329009983,
 -0.0,
 -0.0,
 -0.0]
plt.plot(range(1,9),(reg.coef_*100).tolist(),color="red",label="LR")
plt.plot(range(1,9),(Ridge_.coef_*100).tolist(),color="orange",label="Ridge")
plt.plot(range(1,9),(lasso_.coef_*100).tolist(),color="k",label="Lasso")
plt.plot(range(1,9),([0]*8),color="grey",linestyle="--")
plt.xlabel('w') #横坐标是每一个特征对应的系数
plt.legend()
plt.show()

It can be seen that compared with ridge regression, the L1 regularization term brought by Lasso has a much heavier penalty on the coefficients, and it will compress the coefficients to 0, so it can be used for feature selection. Therefore, we often let Lasso's regularization The coefficients \alphavary in a small space to find the best regularization coefficients.

3. Select the best regularization parameter value 

class sklearn.linear_model.LassoCV(eps=0.001,n_alphas=100,alphas=None,fit_intercept=True,        normalize=False,precompute='auto',max_iter=1000,tol=0.0001,copy_X=True,cv='warn',                verbose=False,n_jobs=None,positive=False,random_state=None,selection='cyclic')

The parameters of the Lasso class using cross-validation look different from Ridge Regression. This is because Lasso is more sensitive to the value of alpha. Since Lasso is too sensitive to changes in the regularization coefficient, we tend to let it change in a small space \alpha. This small space is very small (not a space between 0.01 and 0.02, which is still too large for Lasso), so an important concept "regularization path " is set to set the regularization coefficient change:

Regularization path regularization path
Assuming there are n features in the eigenmatrix, there are eigenvectors x_{1},x_{2},... x_{n}. For each \alphavalue, we can get a set of parameter vectors corresponding to this feature vector w, which contains n+1 parameters, which are w_{0},w_{1},w_{2},... w_{n}. These parameters can be viewed as a point in an n-dimensional space. For different \alphavalues, we will get many points in the n-dimensional space, and the sequence formed by all these points is called a regularization path.

We divide the minimum value forming this regularization path \alphaby \alphathe maximum value of \frac{\alpha .min}{\alpha .max}, called the length of the regularization path (length of the path). In sklearn, we can let sklearn automatically generate the value for us by specifying the length of the regularization path (that is, \alphathe ratio between the minimum value and the maximum value of the limit), and the number of paths , which avoids We need to generate a very, very small list of values ​​for the cross-validation class to use, and the class LassoCV can calculate it by itself.\alpha\alpha\alpha

Similar to the cross-validation of ridge regression, in addition to cross-validation, LassoCV will also build a model separately. It will first find the best regularization parameter, and then model it according to the model evaluation index under this parameter. It should be noted that Yes, the model evaluation index of LassoCV uses the mean square error, while the model evaluation index of ridge regression can be set by itself, and the default is R^{2}.

parameter meaning
eps The length of the regularization path, the default is 0.001
n_alphas The number of regularization paths \alpha, the default is 100
alphas A tuple of regularization parameter values ​​that need to be tested. The default is None. When not input, eps and n_alphas are automatically used to automatically generate regularization parameters for cross-validation
cv

The number of times of cross-validation, the default is 3-fold cross-validation

Attributes meaning
alpha_ Call the best regularization parameters selected by cross-validation
alphas_ \alphaThe regularization parameter used for cross-validation is automatically generated using the length of the regularization path and the number of paths.
mse_path Return all cross-validation result details
coef_ Call the coefficients of the model built under the optimal regularization parameters

Practice code:

from sklearn.linear_model import LassoCV
import numpy as np
# 自己建立Lasso进行alpha选择的范围
alpharange=np.logspace(-10,-2,200,base=10)
# 形成以10为底的指数函数 10**(-10)到10**(-2)次方
# 建模
lasso_=LassoCV(alphas=alpharange
              ,cv=5
              ).fit(xtrain,ytrain)
# 查看被选择出来的最佳正则化系数
lasso_.alpha_
0.0020729217795953697
lasso_.mse_path_.mean(axis=1)  #返回均方误差  在岭回归中轴向axis=0

Ridge regression uses mindful verification, so the result of cross-validation is the cross-validation result of each sample under each alpha, so the mean value of cross-validation under each alpha is axis=0, and the cross-row mean value column is alpha

What is returned in Lasso is the result of each fold cross-validation under each alpha value, so to find the cross-validation mean value under each alpha value is axis=1, and the average value across columns is alpha

# 最佳正则化系数下获得的模型的系数结果
lasso_.coef_
array([ 4.29867301e-01,  1.03623683e-02, -9.32648616e-02,  5.51755252e-01,
        1.14732262e-06, -3.31941716e-03, -4.10451223e-01, -4.22410330e-01])
lasso_.score(xtest,ytest)
0.6038982670571434
# 与线性回归相比如何
reg=LinearRegression().fit(xtrain,ytrain)
reg.score(xtest,ytest)
0.6084558760596188
# 使用LassoCV自带的正则化路径长度和路径中的alpha个数来自动建立alpha选择的范围
ls_=LassoCV(eps=0.00001
           ,n_alphas=300
           ,cv=5
           ).fit(xtrain,ytrain)
ls_.alpha_
0.0020954551690628557
ls_.score(xtest,ytest)
0.60389154238192
ls_.coef_
array([ 4.29785372e-01,  1.03639989e-02, -9.31060823e-02,  5.50940621e-01,
        1.15407943e-06, -3.31909776e-03, -4.10423420e-01, -4.22369926e-01])

Guess you like

Origin blog.csdn.net/weixin_60200880/article/details/128074532