Day3 Chapter 3 Linear Model

  This chapter is also a chapter of the basic theory of this book. I am vague about some formulas at the end of this chapter. This chapter involves the basic knowledge of linear algebra and probability theory, and talks about several classic linear models, regression, classification ( binary classification and multi-classification) tasks.

3.1 Basic form

       Given an example x = (x 1 ; x 2 ; ... ; x d ) described by d attributes, where x i is the value of x on the ith attribute, the linear model attempts to learn a pass A function that predicts a linear combination of attributes, namely:

f(x) = w1x1 + w2x2 + … + wdxd + b

Usually written in vector form as:

f(x) = wTx + b

where x = (x 1 ; x 2 ; ... ; x d ), after w and d are learned, the model is determined.

       Linear models are simple in form and easy to model, and contain some important basic ideas in machine learning. Many more powerful nonlinear models can be based on linear models by introducing hierarchical structures or high-dimensional mapping . In addition, since w intuitively expresses the importance of each attribute in prediction, the linear model has good comprehensibility. For example, in the watermelon problem, I learned "f good melon (x) = 0.2*x color + 0.5*x root + 0.3*x knock sound + 1".

 

3.2 Linear regression

       Given a dataset D = {(x 1 , y 1 ), (x 2 , y 2 ), … , (x m , y m )}, where x i = (x i1 ; x i2 ; … ; x id ) , y i ∈ R. " Linear regression " attempts to learn a linear model that predicts real-valued output labels as closely as possible.

Linear regression tries to learn

 

We are going to determine w and b. As introduced in Section 2.3, the mean squared error (2.2) is the most commonly used performance measure in regression tasks, so we can try to minimize the mean squared error , namely:

 

w * , b * represent the solutions of w and b.

       The mean squared error has a very good geometric meaning and corresponds to the commonly used Euclidean distance or " Euclidean method" for short . In linear regression, the least squares method is to try to find a straight line that makes all samples to The sum of Euclidean distances on a straight line is the smallest .

       The process of solving the minimization of w and b is called the least squares " parameter estimate " of the linear regression model , and then E (w, b) is derived from w and b, respectively, to get:

 

Then set the above two equations to zero, and obtain the closed-form solution of the optimal solution of w and b:  

 

where is the mean of x.

       More generally, the dataset D, the samples are described by d attributes, at this time we try to learn:

 

This is called " multivariate linear regression" (this part involves the complex teaching of formulas, and the specific derivation of linear algebra knowledge is in the book)

       When we hope that the predicted value of the linear model is close to the real mark y, we get the linear regression model. For the convenience of observation, we write:

 

We want the output marker y to vary on an exponential scale by doing:

 

This is " log -linear regression", and it is actually trying to approximate y, which is clear in the following figure:

 

More generally, consider a monotonically differentiable function g(*), let:

 

The resulting model is called a " generalized linear model", where the function g(*) is called a " link function". Obviously, log-linear regression is a special case of generalized linear models when g(*) = ln(*).

 

3.3 Log-Linear Regression

       Consider the binary classification task, the output label y∈{0, 1}, and the predicted value generated by the linear regression model is a real value, so we need to convert the real value to 0/1 value, the ideal is " unit step function " (unit-step function)”

 

That is, if the predicted value z is greater than zero, it is judged as a positive example, if it is less than zero, it is judged as a negative example, and the predicted value z can be judged arbitrarily, as shown in the following figure:

 

The logarithmic function (logistic function) is such a commonly used alternative function:

 

Bringing in the logarithmic probability function as g - (*), we get:

 

Tidy it up:

 

If y is regarded as the possibility that the sample x is a positive example, then 1-y is the probability of its negative example, and the ratio of the two is:

 

It is called " odds ", which reflects the relative likelihood of sample x being a positive example, and then takes the logarithm of the odds to get " log odds (also called logit)":

 

It can be seen from this that the prediction result of the linear regression model is actually used to approximate the logarithmic probability of the real mark. Therefore, the corresponding model is called " logistic regression (logit regression)". It should be noted that although its name is "regression", it is actually a classification learning method.

 

3.4 Linear Discriminant Analysis

       Linear Discriminant Analysis ( LDA ) is a classic linear learning method, which was first proposed by [Fisher, 1936] in the binary classification problem, so it is also called " Fisher discriminant analysis ".

       The idea of ​​LDA is very simple: given a set of training samples, try to project the samples onto a straight line, so that the projection points of similar samples are as close as possible, and the projection points of heterogeneous samples are as far away as possible. The following figure is clear at a glance:

 

The subsequent derivation process is in the book.

 

3.5 Multi-class learning

       In reality, multi-classification tasks are often encountered. Some binary classification learning methods can be directly generalized to multiclass classification. But in most cases, it is necessary to use a binary classifier to solve multi-class problems based on some basic strategies.

       Without loss of generality, considering N categories C 1 , C 2 , ... , C N , the basic idea of ​​multi-class learning is the "disassembly method", that is, the multi-class task is divided into several binary tasks to solve. There are three most classic split strategies: " One vs. One ( OvO for short )", " One vs. Rest ( OvR for short )" and " Many vs. Many ( referred to as OvR)". MvM )".

 

3.6 Class Imbalance Problem

 

(read subsequent chapters, to be continued)

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325346111&siteId=291194637