Article Directory
Basic form of linear model
Sample predictions through linear combinations of attributes:
f ( x ) = w 1 x 1 + w 2 x 2 + . . . + w d x d + b f(x)=w_1x_1+w_2x_2+...+w_dx_d+b f(x)=w1x1+w2x2+...+wdxd+b
Written in the form of a vector:
f ( x ) = w T x + b f(x)=w^Tx+b f(x)=wTx+b
w represents the weight of each attribute, b is the bias value, x is the sample vector, and f(x) is the predicted value
Linear regression
Regression analysis is a predictive modeling that studies the relationship between independent variables and dependent variables
Mathematical description :
Yi I Collection {(xi, yi), i = 1, .. ., n}, x ∈ RP, yi ∈ R \ {(x_i, y_i), i = 1, ..., n \}, x ∈ R ^ P, y_i ∈ R{ (xi,Yi),i=1,...,n},x∈RP,Yi∈R
其中yi = f (xi) + ϵ i y_i = f (x_i) + ϵ_iYi=f(xi)+ϵi
ϵ i ϵ_i ϵiMeans to yi y_iYi Prediction error
Solving parameters w and b
Generally speaking, we want to minimize the mean square error of the predicted value, and the parameters at this time are the parameters we want
Cost function:
J ( w , b ) = ∑ i = 1 n ϵ i 2 = ∑ i = 1 n ( y i − f ( x i ) ) 2 J(w,b) =∑_{i=1}^nϵ^2_i=∑^n_{i=1}(y_i − f(x_i))^2 J(w,b)=i=1∑nϵi2=i=1∑n( andi−f(xi))2
Linear regression models are trained using the smallest multiplication method.
Least squares criterion : The sum of squares of prediction residuals of each training sample is the smallest.
By minimizing the cost function, w and b are obtained:
[ w ∗ , b ∗ ] = a r g m i n J ( w , b ) [w^*,b^*]=argmin~J(w,b) [w∗,b∗]=a r g m i n J ( w , b)
Common parameter solving methods
1. Analytical method
Find the partial derivative of the function, and then set the partial derivative to 0 (but you may encounter the case where the matrix is not invertible).
Suitable for small samples
2. Numerical optimization method (gradient descent method, etc.)
Iterative solution using methods such as gradient descent
Suitable for situations with a large number of samples
Batch and mini-batch algorithms
1. Batch gradient descent method: use all training samples to estimate the gradient for training, which requires a large amount of calculation
2. Mini-batch gradient descent method: use partial training samples to estimate the gradient for training
3. Stochastic gradient descent method: each time a training sample is drawn from a fixed training set to estimate the gradient for training.
Regularization (parameter norm penalty)
By benchmarking the cost function JJJ adds a parameter norm penalty to limit the learning ability of the model. The overall cost function after regularization is: J ′ (w, b) = J (w, b) + λ Ω (w) J^{'}(w,b)=J(w,b)+λΩ(w )J′(w,b)=J(w,b)+λΩ(w)
Ω(w) represents the penalty item
L1 regularization (lasso regression) : introduce a norm penalty of parameters in the cost function,
L2 regularization (ridge regression) : introduce the two-norm penalty of parameters in the cost function,