Machine Learning-Whiteboard Derivation-Series (3) Notes: Linear Regression Least Squares and Regularized Ridge Regression


0 Notes

Derived from [Machine Learning] [Whiteboard Derivation Series] [Collection 1~23] , I will derive on paper with the master of UP when I study. The content of the blog is a secondary written arrangement of notes. According to my own learning needs, I may The necessary content will be added.

Note: This note is mainly for the convenience of future review and study, and it is indeed that I personally type one word and one formula by myself. If I encounter a complex formula, because I have not learned LaTeX, I will upload a handwritten picture instead (the phone camera may not take the picture Clear, but I will try my best to make the content completely visible), so I will mark the blog as [original], if you think it’s not appropriate, you can send me a private message, I will judge whether to set the blog to be visible only to you or something else based on your reply. Thank you!

This blog is the notes of (series 3), and the corresponding videos are: [(series 3) linear regression 1-least squares and its geometric meaning], [(series 3) linear regression 2-least squares-probabilistic perspective-Gaussian Noise-MLE], [(Series 3) Linear Regression 3-Regularization-Ridge Regression-Frequency Angle], [(Series 3) Linear Regression 4-Regularization-Ridge Regression-Bayes Angle].

The text starts below.


1 Least squares method for linear regression model

D = {(x 1 , y 1 ), (x 2 , y 2 ),…,(x N , y N )}, x i is a p-dimensional vector, ie x i ∈ R p , y i is a one-dimensional real , That is, y i ∈R, i=1,2...N. Let matrix X = (x 1 ,x 2 ,…,x N ) T , then X is an N×p order matrix; Y = (y 1 ,y 2 ,…,y N ) T , then Y is N×1 order matrix.

Linear regression function provided for F (W) = W T X, wherein = W (W . 1 , W 2 , ..., W P ) T . The square loss function of f(w) (least squares method) is:
Insert picture description here
give the derivative formula before proceeding to the next step:
Insert picture description here
continue to deduct:
Insert picture description here
then the maximum likelihood estimate of w is the last line in the above figure.


2 Geometric meaning

2.1 The geometric meaning of the square loss function

Insert picture description here
In the above figure, the abscissa and ordinate are X and Y respectively, and the straight line is f(w)=w T x. In order to facilitate drawing and explanation, assume that the data set has three points, namely D = {(x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 )}. Substituting x 1 , x 2 , and x 3 into f(w)=w T x, three results of the linear regression model are obtained: w T x 1 , w T x 2 , and w T x 3 .

Then the geometric meaning of the squared loss function is the sum of squares of all [the difference between the actual value and the predicted value of the model].

2.2 Use geometric meaning to find a linear regression model

Now change another way, set the function of linear regression as f(β)=x T β, and treat x T as the coefficient of the function. For the convenience of drawing and explanation, suppose p=2, that is, x i is a 2-dimensional vector, that is, x i ∈R 2 , i=1,2...N, then x i of the i-th sample =(x i1 ,x i2 ) , Y=(y 1 ,y 2 ,…y N ), data set D = {(x 1 , y 1 ), (x 2 , y 2 ),…,(x N , y N )}. It is stipulated that N>>p=2 (in fact, the number of samples N is originally much larger than the sample dimension p, but p=2 is set here), so all data x i form a two-dimensional subspace (should be a p-dimensional subspace, but Here p=2 nothing more).
Insert picture description here
As shown in the figure above, Y is generally not in all data x iIn the formed two-dimensional subspace (if it exists, the linear regression function line is a straight line. This line can also fit all data, but this is not the case in most cases; and it should be a p-dimensional subspace, It's just that p=2 here), so what is to be calculated is the projection of the vector Y to the two-dimensional subspace formed by all the data x i (because the projection of Y on the subspace is the shortest distance from the vector Y).

Since the projection of Y is on the two-dimensional subspace formed by all the data x i , the projection of Y must be represented by the linear combination of x i . Let the projection of Y be xβ (I have a question here, why not x T β, I Didn't figure it out), the normal vector of the two-dimensional space is Y-xβ. And x belongs to the random variable in this two-dimensional space, there is [Y-xβ⊥x], which can be obtained according to the vertical and dot product operations of the vector:
Insert picture description here
Note that there is no, found in the section [1 Least squares method for linear regression model] The w is exactly the same as β in the figure above? ! why?

Because the two models are f(w)=w T x, f(β)=x T β, in fact, the values ​​calculated by both are one-dimensional real numbers, and (x T β) T = β T x=w T x, in this case, it is normal that w and β are exactly the same.


3 Look at the least squares method from the perspective of probability

Let's discuss the least square method from the perspective of probability.

The sample is represented by x, which is a random variable, y is the label value of the sample, and all samples are independent and identically distributed. Suppose the noise data is ε, and ε~N(0,σ 2 ), y=f(w)+ε=w T x+ε, then [y|x;w~N(w T x,σ 2 ) ]. According to this article [ Machine Learning-Whiteboard Derivation-Series (2) Notes: Gaussian Distribution and Probability ], the second picture in the section [3 Probability Density Function of Gaussian Distribution] can get the probability density function of y|x;w as follows :
Insert picture description here
Now ask w MLE . Let the likelihood function L(w) be:
Insert picture description here
Then w MLE is:
Insert picture description here
Look carefully, isn't the square loss function expression framed in the figure above? !

Look at the conditions given in this section again-let the noise data be ε, and ε~N(0,σ 2 ), that is to say [the least square estimation of linear model parameters LSE] is equivalent to [the extreme of linear model parameters Large-likelihood estimation of MLE, while noise ε~N(0,σ 2 )].


4 Regularization method: Ridge regression

Regularization is one of the methods to solve the problem of model overfitting. According to the [1 Least Squares Method for Linear Regression Model] section, the square loss function of the linear regression model is:
Insert picture description here
w obtained by MLE is:
Insert picture description here
Now add a regularization term λp(w) to the loss function L(w), let The objective function J(w)=L(w)+λp(w), where λ is a hyperparameter (hyperparameter is a parameter set before starting the learning process, and the values ​​of other parameters are obtained through training), p(w) is called Penalty. There are two kinds of p(w):

(1) L1 norm regular term, LASSO, Least Absoulute Shrinkage and Selection Operator, minimum absolute shrinkage selection operator. p(w)=||w|| 1 , which is the 1-norm of the parameter w;

(2) L2 norm regular term, Ridge Regression, Ridge Regression. p(w)=||w|| 2 2 =w T w, which is the 2-norm of the parameter w.

Only ridge regression is discussed below.

4.1 Frequency angle

Substituting the square loss function and the L2 norm regular term p(w)=w T w into J(w), we get:
Insert picture description here

The Δ in the above figure is the same as the Δ in the first picture in the section of [1 Least Squares Method for Linear Regression Model], so Δ·Δ T is the same, so it is directly substituted:
Insert picture description here
continue to find w through MLE:
Insert picture description here
then the maximum likelihood estimation of w This is the last line in the picture above.

4.2 Bayesian angle

Now let the noise data be ε, and ε~N(0,σ 2 ), y=f(w)+ε=w T x+ε, then [y|x;w~N(w T x,σ 2 )】. Assuming the parameters w~N(0,σ 0 2 ), according to the Bayesian formula:
Insert picture description here

Since [y|x;w~N(w T x,σ 2 )] and [w~N(0,σ 0 2 )], the probability density function of y|x;w and w is:
Insert picture description here
then w MAP is :
Insert picture description here
Compared with the objective function J(w)=L(w)+λp(w), the corresponding expression has been marked in the last line of the above figure.

Look at the conditions given in this section again-the same is to set the noise data to ε, and ε~N(0,σ 2 ), that is to say [the square loss function with the regular term in the linear model] is equivalent to [linear model Maximum posterior probability estimation of parameters (MAP estimation), and noise ε~N(0,σ 2 )].


5 Summary

1. [LSE of least square estimation of linear model parameters] is equivalent to [Maximum likelihood estimation of linear model parameters MLE, and noise ε~N(0,σ 2 )].

2. [The square loss function with regular term added to the linear model] is equivalent to [the maximum posterior probability estimation of linear model parameters (MAP estimation), and noise ε~N(0,σ 2 )].


END

Guess you like

Origin blog.csdn.net/qq_40061206/article/details/112447541