Watermelon book reading notes (3)-linear model

Summary of all notes: "Machine Learning" Watermelon Book-Summary of reading notes

1. Basic form

A linear model tries to learn a function that predicts through a linear combination of attributes. We generally express it in the form of a vector, f (x) = w T x + bf(x)=w^Tx+bf(x)=wTx+b
becausewww intuitively expresses the importance of each attribute in the prediction, so the linear model has good interpretability.

Two, linear regression

The sample may be described by multiple attributes. At this time, we try to learn f (xi) = w T xi + b, so that f (xi) ≈ yif(x_i)=w^Tx_i+b, so that f(x_i)\approx y_if(xi)=wTxi+B , so that to obtain F ( Xi)andiThis is called "multiple linear regression".

Suppose we think that the output label corresponding to the example changes on an exponential scale, then the logarithm of the output label can be used as the target of the linear model approximation, that is, ln ⁡ y = w T x + b \ln y=w^Tx+ blnand=wTx+b This is "log-linear regression" (log-linear regression), it is actually trying to makeew T x + be^{w^Tx+b}ewT x+bapproximatesyyand

More generally, one can consider the monotonic differentiable function g (⋅) g(\cdot)g(),令 y = g − 1 ( w T x + b ) y=g^{-1}(w^Tx+b) and=g1(wTx+b ) The model obtained in this way is called "generalized linear model", where the functiong (⋅) g(\cdot)g ( ) is called "link function". Obviously, logarithmic linear regression is a generalized linear model whereg (⋅) = ln ⁡ (⋅) g(\cdot) = \ln(\cdot)g()=Special case when ln ( ) .

Three, logarithmic probability regression

The unit step function is not continuous, so you cannot directly find such g (⋅) g(\cdot)g ( ) . We find a "surrogate function" that can approximate the unit step function to a certain extent, and hope that it is monotonically differentiable. The logistic function (logistic function) is just such a commonly used substitute function: y = 1 1 + e − z = 1 1 + e − (w T x + b) y=\frac1{1+e^{-z }}=\frac1{1+e^{-(w^Tx+b)}}and=1+e- with1=1+e(wTx+b)1
l n y 1 − y = w T x + b ln\frac{y}{1-y}=w^Tx+b ln1andand=wTx+b

We can use the maximum likelihood method to estimate www andbbb

Fourth, linear discriminant analysis

Linear Discriminant Analysis (Linear Discriminant Analysis, LDA for short) is a classic linear learning method. It is similar to reducing the dimensionality of the data, and it is large between classes and small within classes .

The idea of ​​LDA is very simple: Given a set of training examples, try to project the examples onto a straight line so that the projection points of similar examples are as close as possible, and the projection points of heterogeneous examples are as far away as possible; when classifying new samples When, project it onto the same straight line, and then determine the category of the new sample according to the position of the projection point.

You can view this article here: Whiteboard Derivation Series Notes (4)-Linear Classification

Five, multi-class learning

The basic idea of ​​multi-class learning is that the "disassembly method" is to split the multi-classification task into several binary classification tasks.

You can look at the commonly used MvM technology: "Error Correction Output Code" (ECOC).

Six, category imbalance

Class-imbalance (class-imbalance) refers to a situation where the number of training examples of different categories in a classification task varies greatly. For example, there are 998 counterexamples, but there are only 2 positive examples, so the learning method only needs to return a learner that always predicts new samples as counterexamples, and it can achieve 99.8% accuracy; however, such a learner is often worthless because it Cannot predict any positive cases.

The decision rule of the classifier is: if y 1 − y> 1 \frac y{1-y}>11 - andand>1 forecast is a positive example.

When the number of positive and negative examples in the training set is different, we let m + m^+m+ Indicates the number of positive examples,m − m^-m Indicates the number of counterexamples. Then the observation probability ism + m − \frac{m^+}{m^-}mm+Because we usually assume that the training set is an unbiased sampling of the real sample population, so the observation probability represents the true probability. Therefore, as long as the prediction probability of the classifier is higher than the observation probability, it should be judged as a positive example, that is: if y 1 − y> m + m − \frac y{1-y}>\frac{m^+}{m^ -}1 - andand>mm+The prediction is a positive example.

Because our classifier makes decisions based on its decision rules, we need to adjust its predicted value, so we have to make, y ′ 1 − y ′ = y 1 − y ∗ m + m − \frac { y'}{1-y'}=\frac y{1-y}*\frac{m^+}{m^-}1andand=1andandmm+
So we have to carry out a basic strategy-"re-scaling".

  1. Directly "undersampling" the anti-examples in the training set, that is, remove some but make the number of positive and negative examples close, and then learn;
  2. "Oversampling" the positive examples in the training set, that is, add some positive examples to make the number of positive and negative examples close, and then learn;
  3. Learning is directly based on the original training set, but when the trained classifier is used for prediction, the above formula is embedded in its decision-making process, which is called "threshold-moving".

"Re-scaling" is also the basis of "cost-sensitive learning".

Next Chapter Portal: Watermelon Book Reading Notes (4)-Decision Tree

Guess you like

Origin blog.csdn.net/qq_41485273/article/details/112755015