Machine Learning-Support Vector Machine SVM

1. Background

There are many ways for classification:

  • Decision tree
    • Attributes of instances are nominal data
    • Objective function are discrete
  • K-nearest neighbor
    • Instances are points in the (e.g.Euclidean) space
  • Support vector machine
    • Instances are points in the (e.g.Euclidean) space
    • Known as a maximum margin classier
    • Originally proposed for classification and soon applied to regression and time series prediction.
    • One of the most effective supervised learning methods.
    • Been used as strong baseline of text processing approaches.

2. Linear Support Vector Machine

2.1 Max margin linear classifier

2.1.1 Problem:

Given a set of training samples { (x_1, y_1), ... (x_n, y_n)},y_i = 1, -1

Define the  classification hyperplane: those above the plane are classified as positive examples, and those below the plane are classified as negative examples.

2.1.2 Linear classifiers:

Linear hyperplane w_1x_1+w_2x_2+...w_nx_n+b=0

Consider the linearly separable case, there are infinite number of hyperplanes that can do the job

Any of these lines would be fine.…but which is the best one?

Functional margin and geometrical margin

Definition: point to a hyperplane function of the separation   hyperplane y=w*x+b=0 case determined, |w*x+b|it can be represented by the point x to the hyperplane distance. The function interval ( indicated by) is:

, When the classification is correct, the function interval is |w*x+b|.

Definition: The function interval from the training point set to the hyperplane. The minimum value of the function interval between the  hyperplane and y=w*x+b=0all sample points (x_i,y_i)in the training set is the function interval from the hyperplane to the training set:

\ hat {\ gamma} = min \ hat {\ gamma_i}

But there is a problem with the function interval defined in this way, that is, if you change w and b proportionally (such as changing them to 2w and 2b), the value of the function interval |f(x)| becomes twice the original (although At this time, the hyperplane has not changed), so only the function interval is not enough. In fact, we can add some constraints to the normal vector w, which leads to the concept of geometrical margin , which really defines the distance from the point to the hyperplane .

Definition: geometric interval 

The geometric interval is the function interval divided by ||w||, that is:,  \tilde{\gamma} =\frac{\hat{\gamma}}{||w||}=\frac{|f(x)|}{||w||}represents the distance from the real point to the hyperplane.

definition : Maximum margin linear classifier: the linear classier with the maximum margin.

We have introduced two additional hyperplanesw_1x_1 + w_2x_2 + ... w_nx_n + b = \ pm 1 parallel to the separation plane w_1x_1+w_2x_2+...w_nx_n+b=0

The distance between the two new hyperplanes is called the margin. And the margin is \frac{2}{||w||_2}

Therefore, the problem would be:

~~~~~~ max_ ~ w, b} \ frac {2 {{|| w || _2} \\ st ~~~ y_i (w_1x_1 + w_2x_2 + ... w_nx_n + b) \ geq 1, which means maximize the margin subject to the condition that all the points are classified correctly. 

Or equivalently:

Although it seems that the margin is only decided by w, b also affects the margin implicitly via its impact on w in the constrain.

2.2 Dual problem formulation

     Optimization:

  •   (Maximize the geometric interval)
  •   (But to meet the condition: all points are correctly classified)

How to solve such optimization problems?

Define the Lagrange function, add a Lagrange multiplier to each constraint condition , and integrate the constraint condition into the objective function, so that we can clearly express us with only one function expression The problem:

The original way to solve this problem :

  • First fix w and b, only change , to maximize L(w,b,\alpha). It's easy to see,
    • When a certain constraint condition is not met, there will be a certain one y_i(w^Tx_i+b)-1<0. At this time \alpha_i, you only need to make the infinite to be maximized L(w,b,\alpha). The maximum value is \infty, and the next step cannot be minimized.
    • When all the constraints are met, all y_i(w^Tx_i+b)-1are> 0, at this time \alpha_i , we only need to let = 0 to maximize L(w,b,\alpha), and the maximum value is \frac{1}{2} ||w||^2, which is what we want to minimize in the next step.
  • Next change w and b to minimize L(w,b,\alpha). If the constraints of the previous step are all satisfied, now it is necessary to minimize\frac{1}{2} ||w||^2

If the direct solution, then one up will have to face two parameters w and b, and is inequality constraints, this solution process is not good to do. May wish to exchange the minimum and maximum positions , the new problem after the exchange is the dual problem of the original problem . When certain conditions are met, the two are equal. At this time, the original problem can be solved indirectly by solving the dual problem.

Solving the dual problem:

  • First fix it , only change w and b to L(w,b,\alpha)minimize it . This requires the partial derivatives of w and b respectively, that is, make ∂L/∂w and ∂L/∂b equal to zero:

                                                           

        Substitute the above result into the previous L: 

        get:

  • Change afterwards \alpha_ito L(w,b,\alpha)maximize. The problem becomes:

 \min_{\alpha_i} (\frac{1}{2} \sum_{i,j=1}^n \alpha_i \alpha_j y_i y_j x_i^T x_j - \sum_{i=1}^n \alpha_i)   s.t    \sum _{i=1}^n \alpha_i y_i = 0, ~~~~~~\alpha_i >0 (i = 1,2,...n )

The processing method of SVM is to only consider support vectors, that is, the few points most relevant to classification, to learn the classifier

2.3 Linearly non-separable case

In the non-separable case, there must be errors. After all, we can't demand all the training samples being classified correctly. We minimize ||w||_2as well as the training classification error.

2.3.1 Loss functions:0/1 loss and Hinge loss

Recall a correct prediction: y_i (w_1x_i ^ {(1)} + ... w_nx_i ^ {(n)} + b) \ geq 1, define z_i = y_i (w_1x_i ^ {(1)} + ... w_nx_i ^ {(n)} + b)

​0/1 loss only penalizes z less than 0, and hinge loss penalizes z less than 1.

more loss functions to replace 0/1 loss:

2.3.2 Introducing slack variables \epsilon_i> 0

The loss function maximizes the geometric interval while minimizing the sum of slack variables. Now, we allow some training sets to fall between the two classification planes, (0< \epsilon_i < 1)and even allow some training samples to be misclassified(\epsilon _i >1)

  • \epsilon = 0, The sample falls on the classification boundary or is correctly classified.
  • 0<\epsilon<1, The sample falls in the middle of the classification boundary, but is correctly classified, such as\epsilon_2
  • \epsilon> 0, The sample is misclassified, such as\ epsilon_ {11}, \ epsilon_7

2.3.3 The dual problem of the linear inseparable case

Here, support vectors ( \alpha_i >0those of) include those that fall on the boundary, those that fall between two classification boundaries, and those that are misclassified.

 

3. Kernel Support Vector Machine

3.1 Implicit mapping of feature space: kernel function

In the case of linear inseparability, the support vector machine first completes the calculation in the low-dimensional space, and then maps the input space to the high-dimensional feature space through the kernel function, and finally constructs the optimal separation hyperplane in the high-dimensional feature space, thus Separate the non-linear data that is not easy to separate on the plane. As shown in the figure, a bunch of data cannot be divided in two-dimensional space, so it is mapped to three-dimensional space:

3.2 Kernel function: how to deal with non-linear data

Let's look at an example of a kernel function. The two types of data shown in the following figure are distributed in the shape of two circles. Such data itself is linear and inseparable. How do we separate the two types of data at this time?

In fact, the data set described in the figure above is generated by using two circles with different radii and a small amount of noise. Therefore, an ideal boundary should be a "circle" instead of a line (hyperplane) . If the sum is used to represent the two coordinates of this two-dimensional plane, we know that the equation of a circle can be written in this form:

 Note that the above form, if we construct another five-dimensional space, where the value of the five-coordinate, respectively , , then obviously, the above equation in the new coordinates can be written as:

 Regarding the new coordinates , this is exactly the equation of a hyperplane! That is to say, if we make a mapping , it will be mapped to  according to the above rules , then the original data in the new space will become linearly separable, so that it can be processed using the linear classification algorithm we derived before. This is the basic idea of ​​the Kernel method to deal with nonlinear problems.

Before further describing the details of Kernel, let's take a look at the intuitive form of the above example after mapping. Of course, you and I might not be able to draw out of 5 dimensional space, but due to the time data generated here by special circumstances, so the actual hyperplane equation here is three-dimensional, just to have it mapped to Therefore , , so a three-dimensional Just in space, the following figure is the result of mapping from 2D to 3D:

 

3.3 Kernel function and its construction

The function that calculates the inner product of two vectors in the implicitly mapped space is called the kernel function.

If we have a function k(x,y) , which is equal to <\ phi (x), \ phi (y)>,  then we do not need to represent the features(\phi) explicitly. k(x,y) is called the kernel function.

Among them, the Gaussian kernel will map the original space to an infinite dimensional space. However, if the parameters are selected very large, the weights on the high-order features actually decay very fast, so in fact (numerical approximation) is equivalent to a low-dimensional subspace; conversely, if the parameters are selected very small , Then any data can be mapped to linearly separable-of course, this is not necessarily a good thing, because there may be a very serious overfitting problem . However, in general, Gaussian kernels are actually quite flexible through control parameters , and they are also one of the most widely used kernel functions. The example shown in the figure below is to map the low-dimensional linearly inseparable data to the high-dimensional space through the Gaussian kernel function:

to sum up:

Supplement: About the topic of SVM

Regarding the support vector machine SVM, the following statement is wrong (C) 

  A. L2 regular term, the function is to maximize the classification interval, so that the classifier has a stronger generalization ability
  B. Hinge loss function, the function is to minimize the empirical classification error
  C. The classification interval is 1/||w||,| |w|| represents the modulus
  D of the vector . When the parameter C is smaller, the classification interval is larger, the classification errors are more, and it tends to be under-learning

When the feature is larger than the amount of data, what classifier should I choose?

Answer: Linear classifiers, because when the dimensionality is high, the data is generally sparse in the dimensional space, and it is likely to be linearly separable.

 

 

Reference materials:

[1] THU 2020 Spring Machine Learning Introduction Courseware

[2] https://blog.csdn.net/v_july_v/article/details/7624837

Guess you like

Origin blog.csdn.net/weixin_41332009/article/details/113801873