The principle of Lasso regression/ridge regression
When learning the role and difference of L1 and L2 regularization, we always see such a picture:
This picture visually explains the different constraint effects of L1 and L2 on the linear model.
At first, I didn't really understand why I drew like this. for example
1. Will the L1-norm contour line and the square error term contour line intersect on a certain coordinate axis?
2. Can Lasso regression only use the sum of squared errors as a loss, can it be replaced by cross entropy?
3. In addition to L1-norm and L2-norm, are there other regularization methods? What is the difference between them?
See my other blog Lasso regression series three: L0, L1, L2, L2,1 norms in machine learning
Now I have figured it out, combined with several good blogs on the Internet, I will sort it out and share it with you. If there are any deficiencies or mistakes, please correct me.
overview
The regression model using the L1 regularization term is called Lasso regression (Lasso Regression), and the regression model using the L2 regularization term is called Ridge Regression.
Therefore, as long as the L1 regular term is added to the regression problem, it can be called Lasso regression, and it is not limited to the case of using the sum of squared errors as the loss.
In this article, first, we will understand what changes will happen to the solution when adding L1-norm Lasso regression and L2-norm ridge regression when using least squares estimation to solve linear regression problems, so as to better understand how to use Lasso regression and ridge regression.
Least Squares Estimation of Linear Models
When estimating parameters for linear models, the method of least squares can be used.
Described in mathematical language, the linear model can be expressed as:
y = X β + ϵ E ( ϵ ) = 0 , C ov ( ϵ ) = σ 2 I y = X\beta +\epsilon \\E(\epsilon)= 0, Cov(\epsilon) = \sigma^2 Iy=Xβ+ϵE ( ϵ )=0,C o v ( ϵ )=p2 I
whereyyy isn × 1 n \times 1n×label vector of 1 , XXX isn × pn \times pn×The characteristic matrix of p (corresponding to the data,nnn is the number of samples,ppp is the number of features),β \betaβ ,ϵ \epsilonϵ is the parameter to be estimated,β \betaβ isp × 1 p \times 1p×Unknown parameter vector of 1 , ϵ \epsilonϵ is the random error.
The least square method is to estimate the parameter vector β \betaThe basic method of β , the idea is to make the error as small as possible, that is, $\epsilon = y- X\beta $ as small as possible, that is, to make Q ( β ) = ∣ ∣ ϵ ∣ ∣ 2 = ∣
∣ y − X β ∣ ∣ 2 = ( y − X β ) T ( y − X β ) Q(\beta) = ||\epsilon||^2 = ||yX\beta||^2 = (yX\beta )^T(yX\beta)Q ( b )=∣∣ϵ∣∣2=∣∣y−Xβ∣∣2=(y−Xβ)T(y−X β )
should be as small as possible.
According to the theorem that the minimum value of a convex function is the minimum value, we can obtain the β \beta where the partial derivative is equal to 0β value, so that the above formula reaches the minimum value, namely:
β ^ = ( XTX ) − 1 XT y \hat\beta = (X^TX)^{-1}X^Tyb^=(XTX)−1XT y
combines the knowledge in matrix theory, whenrank ( X ) = p rank(X)=prank(X)=p ,XTXX^TXXT Xis reversible, thenβ \betaβ has a unique solution, $\hat\beta = \beta $, it is said that $ \hat\beta $ is an unbiased estimate of $ \beta $; whenrank ( X ) < p rank(X)<prank(X)<p ,XXThe X matrix is not full of rank. At this time, we cannot get an unbiased estimate of $ \beta $, which leads torank ( X ) < p rank(X)<prank(X)<There are generally two reasons for p : 1. The number of samples is smaller than the number of features. 2. Even if the number of samples is large, there is a linear relationship between variables (features). Lasso regression and Ridge (ridge regression) are used to solve this problem. of.
Lasso regression and Ridge regression
Lasso regression (lasso regression) is to add a weight β \beta after the objective functionThe 1-norm of β
(the definition of norm in machine learning is different from that in mathematics, please refer to [https://xiongyiming.blog.csdn.net/article/details/81673491] for the specific definition), that is: Q ( β ) = ∣ ∣ y − X β ∣ ∣ 2 2 + λ ∣ ∣ β ∣ ∣ 1 ⟺ arg min ∣ ∣ y − X β ∣ ∣ 2 s . t . ∑ ∣ β j ∣ ≤ s Q(\beta ) = ||yX\beta||^2_2 + \lambda ||\beta||_1 \\ \quad \iff \\ \arg \min ||yX\beta||^2 \quad st \sum |\beta_j | \leq sQ ( b )=∣∣y−Xβ∣∣22+λ∣∣β∣∣1⟺argmin∣∣y−Xβ∣∣2s.t.∑∣βj∣≤s
Ridge regression (ridge regression) is to add a weight β \beta after the objective functionβ的2-范数,即:
Q ( β ) = ∣ ∣ y − X β ∣ ∣ 2 2 + λ ∣ ∣ β ∣ ∣ 2 ⟺ arg min ∣ ∣ y − X β ∣ ∣ 2 s . t. ∑ β j 2 ≤ s Q(\beta) = ||yX\beta||^2_2 + \lambda ||\beta||_2 \\ \quad \iff \\ \arg \min ||yX\beta|| ^2 \quad st \sum \beta_j^2 \leq sQ ( b )=∣∣y−Xβ∣∣22+λ∣∣β∣∣2⟺argmin∣∣y−Xβ∣∣2s.t.∑bj2≤s
Solving the above formula, we can get β \betaβ的岭可以:
β ^ ( λ ) = ( XTX + λ I ) − 1 XT y \hat\beta(\lambda ) = (X^TX+\lambda I )^{-1}X^Tyb^( l )=(XTX+I ) _−1XT y
ensures thatXTX + λ IX^TX+\lambda IXTX+λ I is full rank and reversible. Of course,β ^ ( λ ) \hat\beta(\lambda)b^( λ ) is a biased estimate.
Why is Lasso regression easier to generate sparse solutions?
Let's look at this picture again:
The square error contour is Q ( β ) = ∣ ∣ y − X β ∣ ∣ 2 Q(\beta) = ||yX\beta||^2Q ( b )=∣∣y−Xβ∣∣2 Corresponding equipotential lines
Lasso regression corresponds to the L1 norm contour, and Ridge regression corresponds to the L2 norm contour, both of which pass the regularization parameter λ \lambdaλ to adjust the parameterβ \betaThe degree of constraint of β .
Lasso regression is easy to produce sparse solutions because the L1 norm contains some non-differentiable corners on the coordinate axis (non-differentiable corners), these corners and Q ( β ) = ∣ ∣ y − X β ∣ ∣ 2 Q(\beta) = ||yX\beta||^2Q ( b )=∣∣y−Xβ∣∣The probability of 2 intersecting will be much higher. In Ridge regression, the L2 norm is differentiable everywhere, so sumQ ( β ) = ∣ ∣ y − X β ∣ ∣ 2 Q(\beta) = ||yX\beta||^2Q ( b )=∣∣y−Xβ∣∣2 The probability of intersecting on the coordinate axis will be much smaller.
In addition, for the L1 norm, λ \lambdaThe larger the λ ,∣ ∣ β ∣ ∣ 1 ||\beta||_1∣∣β∣∣1The smaller the range, the greater the probability that the square error contour and the L1 norm contour intersect on the coordinate axis, that is to say β \betaThe greater the probability that the elements in β become 0. Conversely,β \betaThe probability that the elements in β become 0 is smaller.
reference