[Machine Learning] Zhou Zhihua's Reading Notes Chapter 3 Linear Model

1. Basic form

f(æ) = ω1 X1 + ω2 X2 十...+ωdXd + b ,

2. Linear regression

Mean squared error has a very good geometric meaning - it corresponds to the commonly used Euclidean distance or "Euclidean distance" for short

Euclidean distance" (Euclidean distance). The method of solving the model based on the minimization of the mean square error is called the "least square method" (least square method). In linear regression, the least A multiplication is to try to find a straight line such that The sum of the Euclidean distances from all samples to the line is the smallest.

 

 3. Log odds regression

 

 

 

If y is regarded as the possibility that the sample x is a positive example, then 1-y is the probability of its negative example, the ratio of the two

 

 

are called "odds" and reflect the relative likelihood of m being a positive example. Taking the logarithm of the odds yields

"log odds" (also known as logit)

 

 

   Therefore, its corresponding model is called "logistic regression". It should be noted that although its name is "regression", it is actually a classification learning method. This method has many advantages, such as Does it model the classification likelihood directly, without prior assumptions about the data distribution? This avoids the problem of inaccurate assumptions about the distribution; instead of just predicting "classes", it can get approximate probability predictions, which is very useful for It is useful for many tasks that need to use probability to assist decision-making; in addition, the probability function is a derivable convex function of any order, which has good mathematical properties, and many existing numerical optimization algorithms can be directly used to find the optimal solution.

 

4. Logarithmic probability regression training method?

    We know that the output value of the log-odds regression function is a probability. Then the better a logarithmic regression model is, it means that when the sample of the input model is a positive example, the output value of the model is larger (the more connected to 1). When the sample of the input model is a negative example, the output value of the model is smaller (closer to 1). 0). So we can build a function

 

 

In the above formula, hθx is the output value of the rate model, and yi is the real mark of the sample. When the sample is a positive example, y takes 1, and when it is a negative example, y takes 0. The meaning of this whole function is to add up the probability that each sample belongs to its true mark. If we maximize l(θ), then we can find a set of θ that maximizes the probability that each sample belongs to its true label, that is, the function value of the positive example is as close to 1 as possible, and the function value of the negative example is as close as possible to 0. This gives us an excellent log-odds regression model.

 

5. Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) is a classic linear learning method.

The idea of ​​LDA is very simple: given a set of training examples, try to project the examples onto a line,

Make the projection points of similar samples as close as possible and the projection points of heterogeneous samples as far away as possible;

When classifying, project it onto the same straight line, and then determine the category of the new sample according to the position of the projected point. Figure 3.3 shows a two-dimensional schematic diagram.

 

 

  1. Multiclass learning

There are generally three types of solutions: one-to-one, one-to-many, and many-to-many.

One-to -one strategy : For example, there are now four categories A, B, C, and D, and it is necessary to determine which of the four categories a sample belongs to. Then we can train six binary classifiers in advance - A/B, A/C, A/D, B/C, B/D, C/D. Then put the samples to be determined into the six classifiers respectively. Suppose the classification results are A, A, A, B, D, and C, respectively. It can be known that three of the six classifiers consider this sample term A, and there are one each of B, C, and D. So we can consider this sample to be of class A.

One -to-many strategy : For example, there are now four categories A, B, C, and D, and it is necessary to determine which of the four categories a sample belongs to. Then we can train four two-class classifiers in advance - A/BCD, B/ACD, C/ABD, D/ABC, and the output of the classifier is a function value. Then put the samples to be determined into the four classifiers respectively. Suppose the results of the four classifiers are "the probability of belonging to A is 0.9", "the probability of belonging to B is 0.8", "the probability of belonging to C is 0.7", and "the probability of belonging to B is 0.6". Then we can think that this sample belongs to A.

Many-to-many strategy : each time several classes are used as positive classes, and several other classes are used as anti-classes. Among them, this kind of strategy requires special design and cannot be chosen arbitrarily. Commonly used technique: Error-correcting output codes. The work steps are divided into:

  • Coding: Divide N categories M times, and each division divides a part of the category as a positive class and a part as an anti-class, thus forming a two-class training set; in this way, a total of M training sets are generated, and M classifiers can be trained. .
  • Decoding: M classifiers make predictions on the test samples respectively, these prediction marks form a code, compare the prediction code with the respective code of each category, and return the category with the smallest distance as the final prediction result.

 

 

Take the example of the original book to make a detailed demonstration:
if there is a training data set now, it can be divided into four categories - C1, C2, C3, C4.
To be more specific, you can imagine that watermelons can be divided into first-class melons, second-class melons, and second-class melons. A classification system should be trained to judge the grade of a watermelon, such as melons, third-grade melons, and fourth-grade melons.

We split the training dataset five times

  • For the first time, mark C2 as a positive example and the others as negative examples, and train a binary classifier f1
  • For the second time, mark C1 and C3 as positive examples and the others as negative examples, and train a binary classifier f2
  • For the third time, mark C3 and C4 as positive examples and the others as negative examples, and train a binary classifier f3
  • For the fourth time, mark C1, C2, and C4 as positive examples and the others as negative examples, and train a binary classifier f4
  • The fifth time, mark C1 and C4 as positive examples and the others as negative examples, and train a binary classifier f5

According to the process of these five divisions, each class gets an encoding (vector):

  • C1:(-1,1,-1,1,1)
  • C2:(1,-1,-1,1,-1)
  • C3:(-1,1,1,-1,1)
  • C4:(-1,-1,1,1,-1)

If there is a test sample now, the cumulative results corresponding to the five classifiers are
f1: Negative, f2: Negative, f3: Positive, f4: Negative, f5: Positive
, that is, the code/vector corresponding to the test sample is (-1,- 1, 1, -1, 1)
Then calculate the vector distance between the code of this test sample and the code of the four categories respectively, you can use the Euclidean distance, and the calculated distance from the C3 category is the smallest. Therefore, it is determined that the test sample belongs to the C3 category.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324862934&siteId=291194637