Lasso regression series two: the principle of Lasso regression/ridge regression

The principle of Lasso regression/ridge regression

When learning the role and difference of L1 and L2 regularization, we always see such a picture:

Intercepted from [Watermelon Book Chapter 11 Feature Selection and Sparse Learning-11.4 Embedded Selection and L1 Regularization]
This picture visually explains the different constraint effects of L1 and L2 on the linear model.

At first, I didn't really understand why I drew like this. for example

1. Will the L1-norm contour line and the square error term contour line intersect on a certain coordinate axis?

2. Can Lasso regression only use the sum of squared errors as a loss, can it be replaced by cross entropy?

3. In addition to L1-norm and L2-norm, are there other regularization methods? What is the difference between them?
See my other blog Lasso regression series three: L0, L1, L2, L2,1 norms in machine learning

Now I have figured it out, combined with several good blogs on the Internet, I will sort it out and share it with you. If there are any deficiencies or mistakes, please correct me.

overview

The regression model using the L1 regularization term is called Lasso regression (Lasso Regression), and the regression model using the L2 regularization term is called Ridge Regression.

Therefore, as long as the L1 regular term is added to the regression problem, it can be called Lasso regression, and it is not limited to the case of using the sum of squared errors as the loss.

In this article, first, we will understand what changes will happen to the solution when adding L1-norm Lasso regression and L2-norm ridge regression when using least squares estimation to solve linear regression problems, so as to better understand how to use Lasso regression and ridge regression.

Least Squares Estimation of Linear Models

When estimating parameters for linear models, the method of least squares can be used.

Described in mathematical language, the linear model can be expressed as:
y = X β + ϵ E ( ϵ ) = 0 , C ov ( ϵ ) = σ 2 I y = X\beta +\epsilon \\E(\epsilon)= 0, Cov(\epsilon) = \sigma^2 Iy=Xβ+ϵE ( ϵ )=0,C o v ( ϵ )=p2 I
whereyyy isn × 1 n \times 1n×label vector of 1 , XXX isn × pn \times pn×The characteristic matrix of p (corresponding to the data,nnn is the number of samples,ppp is the number of features),β \betaβ ,ϵ \epsilonϵ is the parameter to be estimated,β \betaβ isp × 1 p \times 1p×Unknown parameter vector of 1 , ϵ \epsilonϵ is the random error.

The least square method is to estimate the parameter vector β \betaThe basic method of β , the idea is to make the error as small as possible, that is, $\epsilon = y- X\beta $ as small as possible, that is, to make Q ( β ) = ∣ ∣ ϵ ∣ ∣ 2 = ∣
∣ y − X β ∣ ∣ 2 = ( y − X β ) T ( y − X β ) Q(\beta) = ||\epsilon||^2 = ||yX\beta||^2 = (yX\beta )^T(yX\beta)Q ( b )=ϵ2=yXβ2=(yXβ)T(yX β )
should be as small as possible.

According to the theorem that the minimum value of a convex function is the minimum value, we can obtain the β \beta where the partial derivative is equal to 0β value, so that the above formula reaches the minimum value, namely:
β ^ = ( XTX ) − 1 XT y \hat\beta = (X^TX)^{-1}X^Tyb^=(XTX)1XT y
combines the knowledge in matrix theory, whenrank ( X ) = p rank(X)=prank(X)=p ,XTXX^TXXT Xis reversible, thenβ \betaβ has a unique solution, $\hat\beta = \beta $, it is said that $ \hat\beta $ is an unbiased estimate of $ \beta $; whenrank ( X ) < p rank(X)<prank(X)<p ,XXThe X matrix is ​​not full of rank. At this time, we cannot get an unbiased estimate of $ \beta $, which leads torank ( X ) < p rank(X)<prank(X)<There are generally two reasons for p : 1. The number of samples is smaller than the number of features. 2. Even if the number of samples is large, there is a linear relationship between variables (features). Lasso regression and Ridge (ridge regression) are used to solve this problem. of.

Lasso regression and Ridge regression

Lasso regression (lasso regression) is to add a weight β \beta after the objective functionThe 1-norm of β
(the definition of norm in machine learning is different from that in mathematics, please refer to [https://xiongyiming.blog.csdn.net/article/details/81673491] for the specific definition), that is: Q ( β ) = ∣ ∣ y − X β ∣ ∣ 2 2 + λ ∣ ∣ β ∣ ∣ 1 ⟺ arg ⁡ min ⁡ ∣ ∣ y − X β ∣ ∣ 2 s . t . ∑ ∣ β j ∣ ≤ s Q(\beta ) = ||yX\beta||^2_2 + \lambda ||\beta||_1 \\ \quad \iff \\ \arg \min ||yX\beta||^2 \quad st \sum |\beta_j | \leq sQ ( b )=yXβ22+λβ1argminyXβ2s.t.βjs

Ridge regression (ridge regression) is to add a weight β \beta after the objective functionβ的2-范数,即:
Q ( β ) = ∣ ∣ y − X β ∣ ∣ 2 2 + λ ∣ ∣ β ∣ ∣ 2 ⟺ arg ⁡ min ⁡ ∣ ∣ y − X β ∣ ∣ 2 s . t. ∑ β j 2 ≤ s Q(\beta) = ||yX\beta||^2_2 + \lambda ||\beta||_2 \\ \quad \iff \\ \arg \min ||yX\beta|| ^2 \quad st \sum \beta_j^2 \leq sQ ( b )=yXβ22+λβ2argminyXβ2s.t.bj2s

Solving the above formula, we can get β \betaβ的岭可以:
β ^ ( λ ) = ( XTX + λ I ) − 1 XT y \hat\beta(\lambda ) = (X^TX+\lambda I )^{-1}X^Tyb^( l )=(XTX+I ) _1XT y
ensures thatXTX + λ IX^TX+\lambda IXTX+λ I is full rank and reversible. Of course,β ^ ( λ ) \hat\beta(\lambda)b^( λ ) is a biased estimate.

Why is Lasso regression easier to generate sparse solutions?

Let's look at this picture again:

insert image description here

The square error contour is Q ( β ) = ∣ ∣ y − X β ∣ ∣ 2 Q(\beta) = ||yX\beta||^2Q ( b )=yXβ2 Corresponding equipotential lines

Lasso regression corresponds to the L1 norm contour, and Ridge regression corresponds to the L2 norm contour, both of which pass the regularization parameter λ \lambdaλ to adjust the parameterβ \betaThe degree of constraint of β .

Lasso regression is easy to produce sparse solutions because the L1 norm contains some non-differentiable corners on the coordinate axis (non-differentiable corners), these corners and Q ( β ) = ∣ ∣ y − X β ∣ ∣ 2 Q(\beta) = ||yX\beta||^2Q ( b )=yXβThe probability of 2 intersecting will be much higher. In Ridge regression, the L2 norm is differentiable everywhere, so sumQ ( β ) = ∣ ∣ y − X β ∣ ∣ 2 Q(\beta) = ||yX\beta||^2Q ( b )=yXβ2 The probability of intersecting on the coordinate axis will be much smaller.

In addition, for the L1 norm, λ \lambdaThe larger the λ ,∣ ∣ β ∣ ∣ 1 ||\beta||_1β1The smaller the range, the greater the probability that the square error contour and the L1 norm contour intersect on the coordinate axis, that is to say β \betaThe greater the probability that the elements in β become 0. Conversely,β \betaThe probability that the elements in β become 0 is smaller.

reference

L1, L2 regularization method

Lasso—principle and optimal solution

Guess you like

Origin blog.csdn.net/qq_40924873/article/details/128013699