Machine learning linear regression analysis

Basic form of linear model

Sample predictions through linear combinations of attributes:

f ( x ) = w 1 x 1 + w 2 x 2 + . . . + w d x d + b f(x)=w_1x_1+w_2x_2+...+w_dx_d+b f(x)=w1x1+w2x2+...+wdxd+b

Written in the form of a vector:

f ( x ) = w T x + b f(x)=w^Tx+b f(x)=wTx+b

w represents the weight of each attribute, b is the bias value, x is the sample vector, and f(x) is the predicted value

Linear regression

Regression analysis is a predictive modeling that studies the relationship between independent variables and dependent variables

Mathematical description :

Yi I Collection {(xi, yi), i = 1, .. ., n}, x ∈ RP, yi ∈ R \ {(x_i, y_i), i = 1, ..., n \}, x ∈ R ^ P, y_i ∈ R{ (xi,Yi),i=1,...,n},xRP,YiR

其中yi = f (xi) + ϵ i y_i = f (x_i) + ϵ_iYi=f(xi)+ϵi

ϵ i ϵ_i ϵiMeans to yi y_iYi Prediction error

Solving parameters w and b

Generally speaking, we want to minimize the mean square error of the predicted value, and the parameters at this time are the parameters we want

Cost function:

J ( w , b ) = ∑ i = 1 n ϵ i 2 = ∑ i = 1 n ( y i − f ( x i ) ) 2 J(w,b) =∑_{i=1}^nϵ^2_i=∑^n_{i=1}(y_i − f(x_i))^2 J(w,b)=i=1nϵi2=i=1n( andif(xi))2

Linear regression models are trained using the smallest multiplication method.

Least squares criterion : The sum of squares of prediction residuals of each training sample is the smallest.

By minimizing the cost function, w and b are obtained:

[ w ∗ , b ∗ ] = a r g m i n   J ( w , b ) [w^*,b^*]=argmin~J(w,b) [w,b]=a r g m i n J ( w , b)

Common parameter solving methods

1. Analytical method

Find the partial derivative of the function, and then set the partial derivative to 0 (but you may encounter the case where the matrix is ​​not invertible).

Suitable for small samples

2. Numerical optimization method (gradient descent method, etc.)

Iterative solution using methods such as gradient descent

Suitable for situations with a large number of samples

Batch and mini-batch algorithms

1. Batch gradient descent method: use all training samples to estimate the gradient for training, which requires a large amount of calculation

2. Mini-batch gradient descent method: use partial training samples to estimate the gradient for training

3. Stochastic gradient descent method: each time a training sample is drawn from a fixed training set to estimate the gradient for training.

Regularization (parameter norm penalty)

By benchmarking the cost function JJJ adds a parameter norm penalty to limit the learning ability of the model. The overall cost function after regularization is: J ′ (w, b) = J (w, b) + λ Ω (w) J^{'}(w,b)=J(w,b)+λΩ(w )J(w,b)=J(w,b)+λΩ(w)

Ω(w) represents the penalty item

L1 regularization (lasso regression) : introduce a norm penalty of parameters in the cost function,

L2 regularization (ridge regression) : introduce the two-norm penalty of parameters in the cost function,

Guess you like

Origin blog.csdn.net/weixin_43772166/article/details/109576924