Machine Learning Linear Classification

mathematical model

The goal of classification is to match the input x into a unique discrete class Ck. In a plane, we can use a straight line to separate two sets of data, so this straight line, generally speaking, is a (D-1) dimensional hyperplane in a D-dimensional input space, called a linear discriminant function (Linear discriminant function), Also called decision surface

The linearity here is a real linearity, that is, the function model is linear to the input variable x (straight line, plane, hyperplane)

      1. Two classifications

                             

Substitute x, if y>0, enter C1 class, y<0, enter C2 class

                   

w represents the direction of the decision surface (w vector is perpendicular to the decision surface) W0 represents the position of the decision surface

     2. Multi-category

Multi-classification comes from the promotion of binary classification. If there are K classes, we need to introduce a combination of K linear decision functions. For the output t of multi-classification problems, we often use t=[0 0 0 1 0 0] to represent.

                            

 

In other words, which decision function corresponding to x is the largest, and which category to enter for a long time!

Of course, we can include w0 in the following form:

                                                         

linear classification

         1. Least square method for classification

If we want to get a classification model, then automatically classify the incoming data. This can also be regarded as a curve fitting problem: some data are known, and their categories are known, and then these factors are used to estimate the analytical solution of W, using the least square method, this process is called learning

                              

In the above formula, X and T are the data and classification results, and XW is the model value. Taking the derivative with respect to w gives an analytical solution

The least square method is not a better method. There is a big problem that it is less robust to some special points (interference points) and gets bad results. In fact, the failure of the least square method does not surprise us, because a serious problem is that the least square method corresponds to the maximum likelihood method under the assumed Gaussian distribution, but the binary target vector t is obviously not a Gaussian distribution!

       2. Fisher linear discriminant

The idea of ​​Fisher's linear discriminant is to project all the high-dimensional data x onto a one-dimensional straight line !

For data in high-dimensional space, they may be separated, but if projected onto a straight line, it will cause congestion or even coverage, making classification difficult. But we can always find a straight line in one direction, which can separate the projected points the most: points of the same type have the highest aggregation degree, and points of different types have the largest average distance!

Fisher linear discrimination, the core problem is to find such a straight line. The way is to construct a fisher criterion function. The numerator of this function is the dispersion degree between classes after projection, and the denominator is the dispersion degree among classes after projection. In this way, it is transformed into a problem of the maximum value of a spherical fisher criterion function.

The specific process is as follows:

                            

Use this formula to project x into a one-dimensional space y, where y represents a value.

                        

This is the fisher criterion function. For the case of two classifications, m1 m2 is the class mean after projection, and s is the covariance (dispersion) within the projected class

More generally, we can express it as follows:

                        

This is obviously the result after substituting y=wx, because what we require is w, so it is necessary to establish a fisher criterion function that explicitly displays w.

Among them, SB is the inter-class covariance matrix, and SW is the intra-class covariance matrix.

At this time, by solving the maximum value of this function, we can get the direction of W, as follows:

                               

We can see here that Fisher's linear discriminant method is a method to find the optimal projection direction, which is not essentially a classification. However, since the transformed pattern is the same, the discriminative interface is actually a point on the axis where various samples are located, that is, a threshold y0 is determined. When wx>=y0, it belongs to category C1, otherwise it belongs to category C2.

    3. Perceptron algorithm

The perceptron algorithm is a binary classification algorithm. In this model, the input vector x is first transformed into a feature vector ϕ(x) through a fixed nonlinear transformation. This feature vector is then used to construct a more general linear model

                                             

The properties of the perceptron f are as follows:

                                            

We can find that this turns the classification problem into a process of solving the weight w. For the case of correct classification, w T ϕ(xn )tn > 0 ie a*f(a)>0 and for the case of wrong classification, the result is less than 0;

Therefore, we add up all misclassified cases and get a function called the perceptron criterion function:

                                        

Solving the minimum value of E turns the classification problem into an extremum problem. Here we can't directly find the derivative of w to be equal to 0, because it seems that we can't do this (the specifics are not clear), so we need to use the gradient descent method to update W again and again

                                

The learning rate here is often set to 1, and the process shown in the figure below can be obtained.

                                         

                         

The perceptron algorithm cannot be directly extended to the case of K>2, and the algorithm will never converge for data that cannot be linearly separated. Even for linearly separable points, there may be multiple solutions, and the final solution depends on the initial value setting and the order in which the data appear.

Guess you like

Origin blog.csdn.net/Eyesleft_being/article/details/80412973