[Machine Learning] Watermelon Book Notes (2) - Linear Model

This article mainly records part of the third chapter of " Machine Learning " - Linear Model .

Basic form of linear model

Vector form of a linear model:
f ( x ) = w T x + bf(\bm x) = \bm w^T \bm x + bf(x)=wTx+The b linear model is simple in form and easy to model, and the weight vectorw \bm ww intuitively expresses the input samplex \bm xThe importance of each feature dimension of x in the prediction has good interpretability (comprehensibility), which is the advantage of the linear model.

In addition, linear models contain some important ideas of machine learning, and many more powerful nonlinear models (nolinear models) are obtained by introducing hierarchical structures or advanced mappings on the basis of linear models.

linear regression

One of the classic tasks of linear models is the regression task. The so-called "linear regression" task is to try to train a linear model to predict its output label for new samples as accurately as possible.

The general situation is:
for a given data set D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( xm , ym ) } D=\{(\bm x_1,y_1), (\bm x_2,y_2), \cdots, (\bm x_m,y_m)\}D={ (x1,y1),(x2,y2),,(xm,ym)},其中 x i = ( x i 1 ; x i 2 ; ⋯   ; x i d ) , y i ∈ R . \bm x_i=(x_{i1}; x_{i2}; \cdots; x_{id}), y_i \in \mathbb R. xi=(xi 1;xi2;;xid),yiR. The task of linear regression is to try to learn f ( x ) = w T x + b , making f ( xi ) ≃ yif(\bm x) = \bm w^T \bm x + b, making f(\ bm
x_i) \simeq y_if(x)=wTx+b,such that f ( xi)yiThis is called " multivariate linear ergression".

The task of learning is to determine the weight vector w \bm ww andbiasbbb , for the sake of discussion, rememberw ^ = ( w ; b ) \hat {\bm w} = (\bm w; b)w^=(w;b) y = ( y 1 ; y 2 ; ⋯   ; y m ) \bm y = (y_1;y_2;\cdots;y_m) y=(y1;y2;;ym) ,
insert image description here
how to determinew ^ \hat {\bm w}wWhat about ^ ? First of all, there needs to be a measurement model prediction valuef ( xi ) f(\bm x_i)f(xi) and actual valueyyThe indicator of the difference between y , a commonly used indicator isthe mean square error:
E w ^ = ( y − X w ^ ) T ( y − X w ^ ) E_{\hat {\bm w}} = (\bm y - \bm X\hat {\bm w})^T(\bm y - \bm X\hat {\bm w})Ew^=(yXw^)T(yXw^ )to minimize the mean square error, we can getw ^ \hat {\bm w}w^,即w ^ ∗ = arg ⁡ min ⁡ w ^ ( y − X w ^ ) T ( y − X w ^ ) \hat {\bm w}^* = \arg \min_{\hat {\bm w} } (\bm y - \bm X\hat {\bm w})^T(\bm y - \bm X\hat {\bm w})w^=argw^min(yXw^)T(yXw^ )to obtain a multiple linear regression model.

log odds regression

In addition to regression tasks, the application of linear models also has classification tasks .
When a linear model is applied to a classification task, it is only necessary to find a monotonically differentiable function that converts the true label yy of the classification tasky is linked to the predicted value of the regression model.

Taking the binary classification task as an example, the output label y ∈ { 0 , 1 } y\in\{0,1\}y{ 0,1 } , while the predicted value z = w T x + bz = \bm w^T \bm x+ bproduced by the linear regression modelz=wTx+b is a real value, in order to connect the two, it is necessary to find a monotone differentiable function, the commonly used is thelogarithmic probability function(logistic function): y = 1 1 + e − zy = \frac{1}{1 + e ^{-z}}y=1+ez1This function will zzThe z value is converted to a value close to 0 or 1,z = w T x + bz = \bm w^T \bm x+ bz=wTx+b代入得到 y = 1 1 + e − ( w T x + b ) ⇒ ln ⁡ y 1 − y = w T x + b y = \frac{1}{1 + e^{-( \bm w^T \bm x+ b)}} \quad \Rightarrow \quad \ln\frac{y}{1-y} = \bm w^T \bm x+ b y=1+e(wTx+b)1ln1yy=wTx+b It can be seen that such a model is actually using the prediction result of the linear regression modelw T x + b \bm w^T \bm x+ bwTx+b to approximate the true markeryyLog odds of y ln ⁡ y 1 − y \ln\frac{y}{1-y}ln1yy, so it is called the " logarithmic probability regression " (logistic regression) model. Although the name is "regression", it is actually a classification learning method .

Advantages of the above approach:

  1. Model classification likelihood directly without prior assumptions about the data distribution
  2. Not only can the "category" be predicted, but also an approximate probability prediction can be obtained, which will be very useful in some tasks that need to use probability to assist decision-making
  3. The logarithmic probability function is a convex function that can be differentiated at any order, and has good mathematical properties. Many existing numerical optimization algorithms can be directly used to find the optimal solution

Determine w and b in the above model \bm w and bw and b can be estimated by "maximum likelihood method".

Linear Discriminant Analysis (LDA)

The basic idea of ​​Linear Discriminant Analysis (LDA) is as follows:

Given a training sample set, try to project all samples onto a straight line, so that the projection points of samples belonging to the same class on the line are as close as possible , while the projection points of samples belonging to different classes are as far away as possible ; after training the model, in When it is applied to a new sample, that is, when classifying a new sample, it is projected onto the same straight line, and its category is determined according to the distance from the projection point of the new sample to the center point of each type of sample . Whichever category is closest to it belongs to which category. one type.

Guess you like

Origin blog.csdn.net/hypc9709/article/details/121479190