Machine Learning SVM

1. Linear classifier:

    First, a very, very simple classification problem (linearly separable) is given. We will use a straight line to separate the black points and white points in the figure below. Obviously, the straight line on the figure is one of the straight lines we require. One (there can be an infinite number of such lines)

                                                                                                          image   

Suppose we set the black point = -1, the white point = +1, the straight line f(x) = wx + b, where x and w are vectors, in fact, it is equivalent to f(x) when written in this form = w1x1 + w2x2 … + wnxn + b, when the dimension of the vector x = 2, f(x) means two

A straight line in the dimensional space, when the dimension of x = 3, f(x) represents a plane in the 3-dimensional space, and when the dimension of x = n > 3, it represents the n-1 dimension in the n-dimensional space hyperplane. These are relatively basic content, if you are not clear, you may need to review a

The content of calculus and linear algebra.

 As I just said, we set the black and white points to be +1, -1 respectively, so when a new point x needs to be predicted to which category, we can use sgn(f(x)) to predict , sgn represents the sign function, when f(x) > 0, sgn(f(x)) = +1,

sgn(f(x)) = –1 when f(x) < 0. But how can we obtain an optimal dividing line f(x)? The straight line in the figure below represents several possible f(x)

                                                                                                          image

    A very intuitive feeling is to let this line be the farthest to the nearest point in a given sample. This sentence is rather awkward to read. Here are a few pictures to illustrate: The first division method:                           

                                                                                                      image

    The second division:

                                                                                                          image

 Which of the two divisions is better? Intuitively, the larger the segmentation gap, the better, and the more separated the points of the two categories, the better. Just like we usually judge whether a person is male or female, it is difficult to make mistakes. This is the difference between the two categories of male and female.

The gap is very large, which allows us to classify more accurately. In SVM, it is called Maximum Marginal, which is one of the theoretical foundations of SVM. There are many reasons for choosing the function that maximizes the gap as the dividing plane.

From the point of view of rate, it is to make the confidence of the point with the smallest degree of confidence the largest (sounds a mouthful), from a practical point of view, this effect is very good.

The points circled in red and blue in the above image are the so-called support vectors.

                                                                                              image   

The picture above is a description of the gaps in the categories mentioned earlier. Classifier Boundary is f(x), the red and blue lines (plus plane and minus plane) are the planes where the support vector is located, and the gap between the red and blue lines is I

The gap between classes we want to maximize.

                                                                                           image

    The formula for M is directly given here: (It can be easily obtained from analytic geometry in high school)

                                                                                                                  image

    In addition, the support vector is located on the line between wx + b = 1 and wx + b = -1. We multiply the category y to which the point belongs (remember? y is either +1 or -1) to get support. The expression for the vector is: y(wx + b) = 1, so it can be simpler

The support vector is represented. When the support vector is determined, the segmentation function is determined, and the two problems are equivalent. Another function of getting the support vector is that the points behind the support vector do not need to participate in the calculation.

    At the end of this subsection, the expression we want to solve optimally is given:

                                                                                                         image

    ||w|| means the second norm of w, which is the same as the denominator of the M expression above. We obtained earlier, M = 2 / ||w||, maximizing this expression is equivalent to minimizing | |w||, and since ||w|| is a monotonic function, we can add a square to it, and the previous coefficient,

Familiar students should be able to see it easily, this formula is for the convenience of derivation.

    This formula has some restrictions. It should be written down as follows: ( original question )

                                                                                                 image

    The meaning of st is subject to, which is the meaning under the latter restriction. This word is very easy to see in the svm paper. This is actually a quadratic programming (QP) problem with constraints, which is a convex problem.

It means that there will be no local optimal solution. You can imagine a funnel. No matter where we place a small ball in the funnel at the beginning, the small ball will eventually fall out of the funnel, that is, to obtain the global optimal solution. The restriction after st can be regarded as a

Convex polyhedron, all we have to do is find the optimal solution in this convex polyhedron. These questions are not expanded here, because if they were expanded, a book would not be finished. Take a look at wikipedia if in doubt.

 

2. Convert it into a dual problem and optimize the solution:

    This optimization problem can be solved by the Lagrangian multiplier method. Using the theory of KKT conditions, the Lagrangian objective function of this formula is directly obtained here:

                                                                                              image

    The process of solving this formula requires knowledge of Lagrangian duality (pluskid also has an article dedicated to this problem), and there is a certain formula derivation, if you are not interested, you can skip directly to the back and use the blue formula conclusion, this part derives the main

Refer to the article from plukids .

    First, minimize L with respect to w and b, and set the partial derivatives of L with respect to w and b to be 0, respectively, and obtain an expression about the original problem .

                                                                                           image

    Bring the two equations back to L(w,b,a) to get the expression for the dual problem

                                                                                          image

    The new problem plus its constraints are ( dual problem ):

                                                                                     image

    This is the formula we need to optimize in the end. So far, the optimization formula of the linearly separable problem is obtained.

    There are many ways to solve this formula, such as SMO , etc. I personally think that solving such a constrained convex optimization problem and obtaining this convex optimization problem are relatively independent two things, so I will prepare it completely in this article. Does not involve how to solve this topic,

If you have time, you can make up an article to talk about it :).

 

3. The case of linear inseparability (soft interval):

    Next, let's talk about the case of linear inseparability, because the assumption of linear separability is too limited:

    The following figure is a typical linear inseparable classification map, we have no way to use a straight line to divide it into two regions, each region only contains points of one color.

                                                                                                                    image    

     To get a classifier in this case, there are two ways. One is to use a curve to completely separate it. The curve is a nonlinear situation, which has a certain relationship with the kernel function that will be discussed later:

                                                                                                             image

     The other is to use a straight line, but there is no need to guarantee the separability, that is, to accommodate those misclassifications, but we have to add a penalty function, so that the more reasonable the wrong points are, the better. In fact, in many cases, the more perfect the classification function is, the better it is not during training.

Because some data in the training function is originally noise, it may be wrong when adding classification labels manually. If we learn these wrong points during training (learning), the model will encounter these errors next time. it is inevitable when

Wrong (if a certain knowledge point is wrong when the teacher is giving you a lecture, and you still believe it is true, then it is inevitable that you will make mistakes in the exam). The process of learning "noise" during this kind of learning is an over-fitting, which is a problem in machine learning.

This is a big taboo. We would rather learn less content than learn more wrong knowledge. Back to the topic, how to use straight lines to divide linearly inseparable points:

     We can add a little penalty to a wrongly classified point, and the penalty function for a wrongly classified point is the distance from the point to its correct position:

                                                                                                          image

 

    In the above figure, the blue and red lines are the boundaries of the support vectors, the green lines are the decision function, and the purple lines represent the distance from the wrong point to its corresponding decision surface, so that we can use it on the original function. plus a penalty function with

The above restrictions are:

                                                                                                   image

    The blue part in the formula is the penalty function part added on the basis of the linearly separable problem. When xi is on the correct side, ε=0, R is the number of all points, and C is a user-specified value. Coefficient, indicating how much penalty is added to the wrong point, when C is large

When C is , there will be fewer misclassified points, but overfitting may be more serious. When C is small, there may be many misclassified points, but the resulting model may not be correct. , so how to choose C is a lot of knowledge, but in most cases

It is obtained through experience.

    The next step is the same, solve a Lagrangian dual problem, and get an expression for the dual problem of the original problem:

                                                                                                   image

    The blue part is the difference from the linearly separable dual problem expression. The dual problem obtained in the case of linear inseparability, the difference is that the range of α is changed from [0, +∞) to [0, C], and the increased penalty ε does not add any complexity to the dual problem.

 

Fourth, the kernel function:

    In the case of inseparability just now, I mentioned that if you use some nonlinear methods, you can get a curve that perfectly divides the two categories, such as the kernel function that will be said next.

    We can change the space from the original linear space to a higher-dimensional space, and in this high-dimensional linear space, use a hyperplane to divide it. Here is an example to understand how to use the dimension of space to become higher to help us classify

    The following figure is a typical linear inseparable situation

                                                                                                                   image

    But when we map these two ellipse-like points to a high-dimensional space, the mapping function is:

image    Use this function to map the points in the plane of the above figure to a three-dimensional space (z1, z2, z3), and rotate the mapped coordinates to obtain a linearly separable point set.

                                                                                                              rotate

       To use another philosophical example: There are no two identical objects in the world. For all two objects, we can make them finally different by adding dimensions, such as two books, from (color, content ) in two dimensions,

It may be the same, we can add the dimension of author, if not, we can also add page number, owner, place of purchase, note content and so on. When the dimension increases to an infinite dimension, it must be possible to make any

The two objects of interest can be separated.

    Recall the expression for the dual problem just obtained:

                                                                                     image

    We can transform the red part, so that: image     what this formula does is to map the linear space to the high-dimensional space, there are many kinds of k(x, xj), the following are the more typical two:

                                                                                        image   

    The above kernel is called a polynomial kernel, and the following kernel is called a Gaussian kernel. The Gaussian kernel even maps the original space into an infinite-dimensional space. In addition, the kernel function has some good properties, for example, it does not add much extra than the linear condition. the amount of computation, etc.,

No further depth here. Generally, for a problem, different kernel functions may bring different results, which generally need to be tried.

 

5. References

(1) Zhou Zhihua's watermelon book. However, many principles and intermediate steps are omitted, which is not very easy to understand.

(2) Blog: The most thorough blog written on the three-layer realm, derivation and principle of SVM, https://blog.csdn.net/macyang/article/details/38782399

 

This article is reprinted from http://www.cnblogs.com/LeftNotEasy/archive/2011/05/02/basic-of-svm.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324609834&siteId=291194637