[Eating the book together] "Machine Learning" Chapter 3 Linear Model

Chapter 3 Linear Models

insert image description here

3.1 Basic form

  given by ddExample of d attribute descriptionx = ( x 1 ; x 2 ; . . . ; xd ) {\bf{x}} = ({x_1};{x_2};...;{x_d})x=(x1;x2;...;xd) , of whichxi x_ixiis x \bf{x}x presentiiFor the value of i attributes, the linear model tries to learn a function to predict through the linear combination of attributes, as follows:
f ( x ) = w T x + b = w 1 x 1 + w 2 x 2 + . . . + wdxd + bf({\bf{x}}) = { { \bf{w}}^T}{\bf{x}} + b = {w_1}{x_1} + {w_2}{x_2 } + ... + {w_d}{x_d} + bf(x)=wTx+b=w1x1+w2x2+...+wdxd+b
  The linear model is simple in form and easy to model, so the linear model has good interpretability.

3.2 Linear regression

  Given data set D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( xm , ym ) } D = \{ ({ {\bf{x}}_1}, { y_1}),({ {\bf{x}}_2},{y_2}),...,({ { \bf{x}}_m},{y_m})\}D={(x1,y1),(x2,y2),...,(xm,ym)},其中 x i = ( x i 1 ; x i 2 ; . . . ; x i d ) , y i ∈ R { {\bf{x}}_i} = ({x_{i1}};{x_{i2}};...;{x_{id}}),{y_i} \in {\R} xi=(xi 1;xi2;...;xid),yiR. _ "Linear regression" attempts to learn a linear model that predicts real-valued output tokens as accurately as possible. The mean square error is the most commonly used performance measure in regression tasks, so it is usually attempted to minimize the mean square error in regression tasks, as follows: ( w ∗ , b ∗ )
= arg ⁡ min ⁡ ( w , b ) ∑ i = 1 m ( f ( xi ) − yi ) 2 (w^*,b^*) = \mathop {\arg \min }\limits_{(w,b)} \sum\limits_{i = 1}^m { { {(f({x_i}) - {y_i})}^2}}(w,b)=(w,b)argmini=1m(f(xi)yi)2
  The mean square error corresponds to the commonly used Euclidean distance (Euclidean distance), and the method of solving the model based on the minimization of the mean square error is called the least squares method. In linear regression, the least squares method is to try to find a straight line that minimizes the sum of the Euclidean distances from all samples to the straight line. For the above formula, we respectivelywww andbbb Take the derivative and make the corresponding partial derivative equal to 0, then the optimal solution can be obtained as follows:
w = ∑ i = 1 myi ( xi − x ) ∑ i = 1 mxi 2 − 1 m ( ∑ i = 1 mxi ) 2 , b = 1 m ∑ i = 1 m ( yi − wxi ) w = \frac{ { \sum\limits_{i = 1}^m { {y_i}({x_i} - \overline x )} }}{ { \sum\limits_{i = 1}^m {x_i^2 - \frac{1}{m}{ { (\sum\limits_{i = 1}^m { {x_i}} )}^2}} } },b = \frac{1}{m}\sum\limits_{i = 1}^m {({y_i} - w{x_i})}w=i=1mxi2m1(i=1mxi)2i=1myi(xix),b=m1i=1m(yiwxi)
  are also common multiple linear regression (multiple attributes), logarithmic linear regression (output changes exponentially) and so on. In multiple linear regression, multiple sets of solutions may be solved, so it is necessary to decide which solution to choose as the output by the inductive preference of the learning algorithm. A common practice is to introduce a regularization term. The following is the commonly used L pL_pLpnorm.

  • The L0 norm refers to the number of non-zero elements in the vector. Its role is to improve the sparsity of model parameters, that is, to make some parameters zero, thereby reducing the number of features.

  • The L1 norm refers to the sum of the absolute values ​​of each element in the vector. Its role is also to improve the sparsity of the model parameters, and the L1 norm can perform feature selection, that is, to make the coefficient of the feature zero. The L1 norm can also be used as a regularization term to prevent the model from overfitting and improve the generalization ability of the model.

  • The L2 norm refers to the sum of the squares of the elements of the vector and then the square root. Its function is to reduce the size of all parameters of the model, which can prevent the model from overfitting and improve the generalization ability of the model. The L2 norm will select more features, which are close to zero but not equal to zero, and can also be used as a regularization term to improve the condition number of the matrix and speed up the convergence.

3.3 Log odds regression

  Considering the binary classification task, its output label $y \in {0,1} $, and the predicted value generated by the linear regression model is a real value, so it needs to be converted, the ideal is the unit step function, that is, the prediction If the value is greater than zero, it is judged as a positive case, if it is less than zero, it is judged as a negative case, and if it is equal to zero, it can be judged arbitrarily. However, considering that the unit step function is discontinuous, it is hoped to find an alternative function that approximates the unit step function to a certain extent, and is monotonically differentiable, so there is a logarithmic probability function.

  The log odds function is a sigmoid function that will zzThe z value translates to ayyy value, the expression isy = 1 1 + e − zy = \frac{1}{ {1 + {e^{ - z}}}}y=1+ez1, another form can also be written as ln ⁡ y 1 − y = w T x + b \ln \frac{y}{ {1 - y}} = { {\bf{w}}^T}{\bf{x} } + bln1yy=wTx+b , hereyyy is regarded as the possibility of the sample being a positive example, and1 − y 1-y1y is its counterexample possibility, soy 1 − y \frac{y}{ {1 - y}}1yyAlso called chance, ln ⁡ y 1 − y \ln \frac{y}{ {1 - y}}ln1yyIt is called logarithmic probability, and the corresponding linear regression model is called logarithmic probability regression. Although it is called regression, it is actually a classification method.

  For logarithmic probability regression w \bf{w}w andbbThe value of b can be estimated by using the maximum likelihood method.

3.4 Linear Discriminant Analysis

  Linear discriminant analysis is a classic linear learning method, also known as "Fisher discriminant analysis". The main idea is to give a set of training samples and try to project the samples onto a straight line so that the projection points of similar samples are as possible as possible. The projected points of close, heterogeneous samples are as far away as possible, as follows:

D = { ( xi , yi ) } i = 1 m , yi = { 0 , 1 } D = \{ ({ {\bf{ x}}_i},{y_i})\} _{   i = 1}^m,{y_i} = \{0,1\} D={(xi,yi)}i=1m,yi={ 0,1 } , the projection of the centers of the two types of samples on the straight line isω T μ 0 { {\bf{\omega }}^T}{ {\bf{\mu }}_0}ohT μ0ω T μ 1 { {\bf{\omega }}^T}{ {\bf{\mu }}_1}ohT μ1,remove T Σ 0 ω { {\bf{\omega }}^T}{ {\bf{\Sigma }}_0}{\bf{\omega }}ohT Σ0ωω T Σ 1 ω { {\bf{\omega }}^T}{ {\bf{\Sigma }}_1}{\bf{\omega }}ohT Σ1ω , if you want to make the projection points of similar samples as close as possible, you can make the covariance as small as possible; if you want to make the projection points of heterogeneous samples as far away as possible, you can let the distance between class centers ∥ ω Tμ 0 − ω T μ 1 ∥ \left\| { { {\bf{\omega }}^T}{ {\bf{\mu }}_0} - { { {\bf{\omega }}^T}{ { \bf{\mu }}_1}} \right\| ohT μ0ohT μ1 as large as possible. The following maximization objective can be obtained comprehensively, and the optimal solution can be obtained by using Lagrange multiplier method and singular value decomposition.
J = ∥ ω T μ 0 − ω T μ 1 ∥ 2 2 ω T Σ 0 ω + ω T Σ 1 ω J = \frac{ { \left\| { { {\bf{\omega }}^T}{ {\bf{\mu }}_0} - { {\bf{\omega }}^T}{ {\bf{\mu }}_1}} \right\|_2^2}}{ { { { \ bf {\omega }}^T}{ {\bf{\Sigma }}_0}{\bf{\omega }} + { { \bf{\omega }}^T}{ { \bf{\Sigma }}_1 }{\bf{\omega }}}}J=ohT Σ0oh+ohT Σ1oh ohT μ0ohT μ1 22

3.5 Multi-Classification Learning

  Multi-category learning tasks are generally solved by dismantling methods. There are three most classic split strategies: one-to-one (OvO), one-to-other (OvR) and many-to-many (MvM). Two pairings will generate N(N-1)/2 two-category tasks; one pair will take one class as a positive example each time, and the rest as a negative example, thus generating N two-category tasks. These two strategies respectively as follows:

  And many-to-many is to use several classes as positive classes each time, and several other classes as anti-classes. The most common one is error correction output code (ECOC). This code also has a certain tolerance and correction ability, and the longer the code, The stronger the error correction ability, it is divided into the following two steps:

  • Coding: Divide N categories M times, each division divides some categories as positive categories and some as negative categories, thus forming a binary classification training set; in this way, a total of M training sets are generated, and N classifications can be trained device.
  • Decoding: M classifiers make predictions on the test samples respectively, and these prediction marks form an encoding. Compare this predicted encoding with each class's respective encoding, and return the class with the smallest distance as the final prediction.

3.6 Class Imbalance Problem

  Category imbalance refers to the situation that the number of training samples of different categories in the classification task is very different. This situation is very common in real classification learning tasks. Generally, the following three methods can be used to solve it:

  • Undersampling: Remove some negative examples so that the number of positive and negative examples is close.
  • Oversampling: Add some positive examples to make the number of positive and negative examples close.
  • Threshold shift: learn based on the original training set, but when predicting with a trained classifier, 1 − y ′ y ′ = 1 − yy × m − m + \frac{ {1 - y'}}{ { y'}} = \frac{ {1 - y}}{y} \times \frac{ { { m^ - }}}{ { {m^ + }}}y1y=y1y×m+membedded in the decision-making process.

  It should be noted that the time overhead of the undersampling method is usually much smaller than that of the oversampling method, because the former discards many negative examples, making the classifier training set much smaller than the initial training set, while the oversampling method adds many positive examples, and its training set is larger than For the initial training set, the oversampling method cannot simply repeat the initial positive samples, otherwise it will lead to serious overfitting.

3.7 Why does logistic regression use cross entropy instead of mean square error as a loss function

  You can refer to this article [Logistic Regression] , which is very good! worth learning!

Guess you like

Origin blog.csdn.net/qq_44528283/article/details/130091424