Machine Learning && Deep Learning - Basic Elements of Linear Regression

Regression is used to represent the relationship between input and output.
Explain linear regression with a practical example: estimate the house price based on the area and age of the house. In order to implement this model for predicting holidays, it is necessary to collect a real data set, which includes the sales price, area and age of the house.
In machine learning, this data set is called a training set , each row of data is called a sample or data point , and the target you are trying to predict is called a label or target . The independent variables (area and age) on which the prediction is based are called features .
Usually, we use n to denote the number of samples in a dataset. For the sample with index i, its input is expressed as:
x ( i ) = [ x 1 ( i ) , x 2 ( i ) ] T x^{(i)}=[x_1^{(i)},x_2^ {(i)}]^Tx(i)=[x1(i),x2(i)]T
and its corresponding label is:
y ( i ) y^{(i)}y(i)

Basic Elements of Linear Regression

linear model

p r i c e = w a r e a ⋅ a r e a + w a g e ⋅ a g e + b price=w_{area}·area+w_{age}·age+b price=wareaarea+wageage+b
Among them, w is the weight, which determines the influence of each feature on our predicted value. b is the bias, which refers to the predicted value when all features are 0.
Strictly speaking, the above formula is an affine transformation of input features, which is characterized by linear transformation through weighted sum features and translation through offset terms.
In machine learning, high-dimensional data sets are usually used, and it is more convenient to use linear algebra representation when modeling. When our input contains d features, we express the prediction result as:
y ^ = w 1 x 1 + . . . + wdxd + b \hat{y}=w_1x_1+...+w_dx_d+by^=w1x1+...+wdxd+b
Put all the features into the vector x, and put all the weights into the vector w, you can use the dot product to express the model concisely:
y ^ = w T x + b \hat{y}=w^Tx+by^=wTx+b
Clearly, a vector x can only correspond to features of a single data sample.
The notational matrix X can conveniently refer to the n samples of our entire dataset. Among them,each row of X is a sample, and each column is a feature.
For the feature set X, the predicted value can be expressed by matrix-vector multiplication as:
y ^ = X w + b \hat{y}=Xw+by^=Xw+b
The summation in this process will use the broadcast mechanism. Given X and y, the goal of linear regression is to find a set of weight vector w and bias b: when given a new sample feature sampled from the same distribution of X, It can make the error of predicting labels of new samples as small as possible.
But even if we are sure that the underlying relationship between features and labels is linear, we will add a noise term to account for the impact of observation errors.
Therefore, before starting to find the best model parameters w and b, two things are needed:
(1) a measure of model quality
(2) a method that can update the model to improve the quality of model prediction

loss function

The loss function is able to quantify the difference between the actual value of the target and the predicted value. Usually choose a non-negative number as the loss, the smaller the value, the smaller the loss, and the loss of perfect prediction is 0.
The most commonly used loss function in regression problems is the squared error function:
l ( i ) ( w , b ) = 1 2 ( y ^ ( i ) − y ( i ) ) 2 l^{(i)}(w,b)= \frac{1}{2}(\hat{y}^{(i)}-y^{(i)})^2l(i)(w,b)=21(y^(i)y(i))2
The constant 1/2 will not bring about an essential difference, but this form will be slightly simpler (because the constant coefficient will become 1 after derivation).
Larger differences between estimates and observations result in larger losses due to the quadratic term in the squared error function. In order to measure the quality of the model on the entire data set, we need to calculate the average loss (equivalent to summation) on n samples of the training set:
L ( w , b ) = 1 n ∑ i = 1 nl ( i ) ( w , b ) = 1 n ∑ i = 1 n 1 2 ( w T x ( i ) + b − y ( i ) ) 2 L(w,b)=\frac{1}{n}\sum_{i= 1}^nl^{(i)}(w,b) =\frac{1}{n}\sum_{i=1}^n{\frac{1}{2}(w^Tx^{(i )}+by^{(i)})^2}L(w,b)=n1i=1nl(i)(w,b)=n1i=1n21(wTx(i)+by(i))2
When training the model, it is hoped to find a set of parameters that can minimize the total loss on all training samples.

Analytical solution

Linear regression is a very simple optimization problem. The solution of linear regression can be simply expressed by a formula. This type of solution is called analytical solution.
First, the bias b is incorporated into the parameter w by appending a column in the matrix containing all parameters. Our prediction problem is to minimize:
∣ ∣ y − X w ∣ ∣ 2 ||y-Xw||^2∣∣yXw2
This has only one critical point on the loss plane, corresponding to the minimum loss point for the entire remainder. Set the derivative of the loss with respect to w to 0 to get an analytical solution:
w ∗ = ( XTX ) − 1 XT yw^*=(X^TX)^{-1}X^Tyw=(XTX)1XT y
But the analytical solution is too restrictive for the problem and is not suitable for wide application in deep learning. Next, we will explain stochastic gradient descent, which can be used to optimize almost all deep learning models.

Guess you like

Origin blog.csdn.net/m0_52380556/article/details/131858754