Linear Model Optimization: Ridge Regression and Lasso Regression

Underfitting and Overfitting

The principle and specific implementation of the linear model LinearRegression have been introduced before . In the code verification phase, the results obtained using the wave data set are: the training set score is 0.67, and the test set score is 0.66. At this time, the scores of the two are very close, but there is still a big gap from the full score of 1, which is an under-fitting performance. One solution to this problem is to add more effective features.

In addition to underfitting, there is another common situation: overfitting. Let's look at the following example: Applying LinearRegression to the bostons dataset

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


def lr_by_sklearn(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    lr = LinearRegression().fit(X_train, y_train)
    print('lr: training set score: {:.2f}'.format(lr.score(X_train, y_train)))
    print('lr: test set score: {:.2f}'.format(lr.score(X_test, y_test)))


if __name__ == '__main__':
    X_arr, y_arr = bostons()
    lr_by_sklearn(X_arr, y_arr)

After running, the result is

lr: training set score: 0.95
lr: test set score: 0.61

It can be seen from the results that the model performs well on the training set, but poorly on the test set, which is overfitting. But in fact, the goal of our training model is to hope that the model can perform well on the test set, so as to ensure the generalization ability of the model, so overfitting is not the result we want to see.

overfitting features

In order to solve the overfitting problem, let's first look at the data characteristics obtained by the linear model when overfitting occurs. After analysis, it is found that the weight coefficient of the linear model is often very large. This is because during overfitting, in order to fit each data point well, the fitting function of the linear model tends to fluctuate greatly, that is, in some small intervals, the change of the function value is very drastic. Since the independent variable itself can be large or small, the weight coefficient must be large enough.

Still using the above example, we output the maximum and minimum values ​​of the weight coefficients. For comparison, the maximum and minimum values ​​of X and y are also output. It can be found that the ranges of X and y are relatively small, but the weight coefficient is 2 orders of magnitude higher.

In[3]: [max(lr.coef_), min(lr.coef_)]
Out[3]: [2980.7814393954595, -2239.8694355606854]
In[4]: [X.min(), X.max()]
Out[4]: [0.0, 1.0]
In[5]: [y.min(), y.max()]
Out[5]: [5.0, 50.0]

Ridge Regression and Lasso Regression

Regularization is to add explicit constraints to the objective function of the model to limit the size of the weight coefficients, thereby reducing overfitting to a certain extent. There are two most common ones: Ridge Regression and Lasso Regression.

Ridge regression

Ridge regression is a regularization method that adds L2 norm on the basis of linear model
J = ∑ i = 1 n ( yi − ( wixi + bi ) ) 2 + λ ∑ i = 1 nwi 2 J=\sum_{i= 1}^n(y_i-(w_ix_i+b_i))^2+\lambda\sum_{i=1}^nw_i^2J=i=1n(yi(wixi+bi))2+li=1nwi2
J = ∑ i = 1 n [ ( ( yi − bi ) − wixi ) 2 + λ wi 2 ] J=\sum_{i=1}^n[((y_i-b_i)-w_ix_i )
^2+\lambda w_i^2]J=i=1n[((yibi)wixi)2+λwi2]
The "[]" part of the summation formula is defined as a new variablefi f_ifi, and will bi b_ibimerged into yi y_iyi
f = ( y − w x ) 2 + λ w 2 f=(y-wx)^2+\lambda w^2 f=(ywx)2+λw2 In the above formula, for the convenience of writing, the subscript ii
is removedi . for the bestwww , forfff Find the first-order reciprocal and make it equal to 0
f ′ = − 2 ( y − wx ) x + 2 λ w = 0 f'=-2(y-wx)x+2\lambda w=0f=2(ywx)x+2 l w=0
finally get
w = xyx 2 + λ w=\frac{xy}{x^2+\lambda}w=x2+lxy
Obviously, λ \lambdaThe introduction of λ can reducewwThe value of w . λ \lambdaλ Etsudai,wwThe smaller w is, the lower the risk of overfitting is, but it will also increase the deviation of the model. Here, the model deviation means the absolute value of the difference between the true value and the predicted value
∣ y − w ∗ x ∣ = ∣ y − x 2 yx 2 + λ ∣ |yw*x|=|y-\frac{ x^2y}{x^2+\lambda}|ywx=yx2+lx2 yλ \lambdaWhen λ is 0, the deviation is also 0; λ \lambdaThe larger the λ , the larger the deviation. That is,λ \lambdaλ plays a role in reconciling model bias and risk of overfitting.

Lasso returns

Add to the base of the Lasso circulatory model the L1 number
J = ∑ i = 1 n ( yi − ( wixi + bi ) ) 2 + λ ∑ i = 1 n ∣ wi ∣ J=\sum_{i=1 }^n(y_i-(w_ix_i+b_i))^2+\lambda\sum_{i=1}^n\mid{w_i}\midJ=i=1n(yi(wixi+bi))2+li=1nwi∣The
derivation of the optimal solution of the above formula is somewhat difficult. Here we directly quotethe articleand give the result:

定义 m j = ∑ i = 1 n x i j ( y i − ∑ k ≠ j w k x i k ) m_j=\sum_{i=1}^nx_{ij}(y_i-\sum_{k \neq j}w_kx_{ik}) mj=i=1nxij(yik=jwkxi) n j = ∑ i = 1 x i j 2 n_j=\sum_{i=1}x_{ij}^2 nj=i=1xij2,Nah wj w_jwj的值为
w j = { ( m j − λ 2 ) / n j ,   if   m j > λ 2 0 ,   if   m j ∈ [ − λ 2 , λ 2 ] ( m j + λ 2 ) / n j ,   if   m j < λ 2 w_j=\left\{ \begin{aligned} (m_j-\frac{\lambda}{2})/n_j,\ \ \text{if} \ \ m_j > \frac{\lambda}{2}\\ 0, \ \ \text{if} \ \ m_j \in [-\frac{\lambda}{2}, \frac{\lambda}{2}]\\ (m_j+\frac{\lambda}{2})/n_j,\ \ \text{if} \ \ m_j < \frac{\lambda}{2} \end{aligned} \right. wj= (mj2l)/nj,  if  mj>2l0,  ifm  j[2l,2l](mj+2l)/nj,  ifm  j<2l
Here we don't have to pay close attention to the meaning of each variable. The more interesting aspect is that some www directly becomes 0, which means that the model has screened out a batch of unimportant features and set its weight coefficient to 0. This is one of the big differences between Ridge Regression and Lasso: Ridge Regression tends to letwwThe w value becomes smaller, and the Lasso regression will partiallywww is set to 0.

Simulation example

The following code evaluates the performance of the three models of linear regression, ridge regression and Lasso regression on the bostons dataset , and outputs and compares the core indicators:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, RidgeCV, LassoCV
import numpy as np


def lr_by_sklearn(X, y):
    # 数据集拆分
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

    # 线性模型
    lr = LinearRegression().fit(X_train, y_train)
    print('lr: training set score: {:.2f}'.format(lr.score(X_train, y_train)))
    print('lr: test set score: {:.2f}'.format(lr.score(X_test, y_test)))
    print('lr: min_coef: {:.2f}, max_coef:  {:.2f}'.format(min(lr.coef_), max(lr.coef_)))
    print('\n=================================\n')

    # 构造不同的lambda值,RidgeCV交叉验证得到最佳lambda
    Lambdas = np.logspace(-5, 2, 20)
    rdCV = RidgeCV(alphas=Lambdas, cv=5)
    rdCV.fit(X_train, y_train)
    print('Ridge_best_lambda: {}'.format(rdCV.alpha_))

    # 输出最佳Ridge模型指标
    best_rd = Ridge(alpha=rdCV.alpha_).fit(X_train, y_train)
    print('rd: training set score: {:.2f}'.format(best_rd.score(X_train, y_train)))
    print('rd: test set score: {:.2f}'.format(best_rd.score(X_test, y_test)))
    print('rd: number of features used: {}'.format(np.sum(best_rd.coef_ != 0)))
    print('rd: min_coef:  {:.2f}, max_coef:  {:.2f}'.format(min(best_rd.coef_), max(best_rd.coef_)))
    print('\n=================================\n')

    # 构造不同的lambda值,LassoCV交叉验证得到最佳lambda
    Lambdas = np.logspace(-5, 2, 20)
    lsCV = LassoCV(alphas=Lambdas, max_iter=1000000, cv=5)
    lsCV.fit(X_train, y_train)
    print('Lasso_best_lambda: {}'.format(lsCV.alpha_))

    # 输出最佳Ridge模型指标
    best_ls = Lasso(alpha=lsCV.alpha_, max_iter=1000000).fit(X_train, y_train)
    print('ls: training set score: {:.2f}'.format(best_ls.score(X_train, y_train)))
    print('ls: test set score: {:.2f}'.format(best_ls.score(X_test, y_test)))
    print('ls: number of features used: {}'.format(np.sum(best_ls.coef_ != 0)))
    print('ls: min_coef: {:.2f}, max_coef: {:.2f}'.format(min(best_ls.coef_), max(best_ls.coef_)))


if __name__ == '__main__':
    X_arr, y_arr = bostons()
    lr_by_sklearn(X_arr, y_arr)

The running results are as follows. It can be seen from the results that due to the introduction of regularization, the scores of the ridge regression model (rd) and the Lasso regression model (ls) on the training set are slightly lower than those of the linear model (lr), from 0.95 to 0.93, but in the test The score on the set has been significantly improved, from 0.61 to 0.76+; in terms of weight coefficients, rd and ls have also been reduced from thousands to tens of magnitudes; in terms of the number of features used by the model, rd uses all the features, There are 104 in total, and ls only uses 60 of them.

lr: training set score: 0.95
lr: test set score: 0.61
lr: min_coef: -2239.87, max_coef:  2980.78
=================================
Ridge_best_lambda: 0.04832930238571752
rd: training set score: 0.93
rd: test set score: 0.76
rd: number of features used: 104
rd: min_coef:  -21.64, max_coef:  27.12
=================================
Lasso_best_lambda: 0.001623776739188721
ls: training set score: 0.93
ls: test set score: 0.77
ls: number of features used: 60
ls: min_coef: -29.34, max_coef: 47.46

Guess you like

Origin blog.csdn.net/taozibaby/article/details/127600962