1. Background
There are many ways for classification:
- Decision tree
- Attributes of instances are nominal data
- Objective function are discrete
- K-nearest neighbor
- Instances are points in the (e.g.Euclidean) space
- Support vector machine
- Instances are points in the (e.g.Euclidean) space
- Known as a maximum margin classier
- Originally proposed for classification and soon applied to regression and time series prediction.
- One of the most effective supervised learning methods.
- Been used as strong baseline of text processing approaches.
2. Linear Support Vector Machine
2.1 Max margin linear classifier
2.1.1 Problem:
Given a set of training samples { },
Define the classification hyperplane: those above the plane are classified as positive examples, and those below the plane are classified as negative examples.
2.1.2 Linear classifiers:
Linear hyperplane
Consider the linearly separable case, there are infinite number of hyperplanes that can do the job
Any of these lines would be fine.…but which is the best one?
Functional margin and geometrical margin
Definition: point to a hyperplane function of the separation hyperplane case determined, it can be represented by the point x to the hyperplane distance. The function interval ( indicated by) is:
, When the classification is correct, the function interval is .
Definition: The function interval from the training point set to the hyperplane. The minimum value of the function interval between the hyperplane and all sample points in the training set is the function interval from the hyperplane to the training set:
But there is a problem with the function interval defined in this way, that is, if you change w and b proportionally (such as changing them to 2w and 2b), the value of the function interval |f(x)| becomes twice the original (although At this time, the hyperplane has not changed), so only the function interval is not enough. In fact, we can add some constraints to the normal vector w, which leads to the concept of geometrical margin , which really defines the distance from the point to the hyperplane .
Definition: geometric interval
The geometric interval is the function interval divided by ||w||, that is:, represents the distance from the real point to the hyperplane.
definition : Maximum margin linear classifier: the linear classier with the maximum margin.
We have introduced two additional hyperplanes parallel to the separation plane
The distance between the two new hyperplanes is called the margin. And the margin is
Therefore, the problem would be:
, which means maximize the margin subject to the condition that all the points are classified correctly.
Or equivalently:
Although it seems that the margin is only decided by w, b also affects the margin implicitly via its impact on w in the constrain.
2.2 Dual problem formulation
Optimization:
- (Maximize the geometric interval)
- (But to meet the condition: all points are correctly classified)
How to solve such optimization problems?
Define the Lagrange function, add a Lagrange multiplier to each constraint condition , and integrate the constraint condition into the objective function, so that we can clearly express us with only one function expression The problem:
The original way to solve this problem :
- First fix w and b, only change , to maximize . It's easy to see,
- When a certain constraint condition is not met, there will be a certain one . At this time , you only need to make the infinite to be maximized . The maximum value is , and the next step cannot be minimized.
- When all the constraints are met, all are> 0, at this time , we only need to let = 0 to maximize , and the maximum value is , which is what we want to minimize in the next step.
- Next change w and b to minimize . If the constraints of the previous step are all satisfied, now it is necessary to minimize
If the direct solution, then one up will have to face two parameters w and b, and is inequality constraints, this solution process is not good to do. May wish to exchange the minimum and maximum positions , the new problem after the exchange is the dual problem of the original problem . When certain conditions are met, the two are equal. At this time, the original problem can be solved indirectly by solving the dual problem.
Solving the dual problem:
First fix it , only change w and b to minimize it . This requires the partial derivatives of w and b respectively, that is, make ∂L/∂w and ∂L/∂b equal to zero:
Substitute the above result into the previous L:
get:
- Change afterwards to maximize. The problem becomes:
s.t
The processing method of SVM is to only consider support vectors, that is, the few points most relevant to classification, to learn the classifier
2.3 Linearly non-separable case
In the non-separable case, there must be errors. After all, we can't demand all the training samples being classified correctly. We minimize as well as the training classification error.
2.3.1 Loss functions:0/1 loss and Hinge loss
Recall a correct prediction: , define
0/1 loss only penalizes z less than 0, and hinge loss penalizes z less than 1.
more loss functions to replace 0/1 loss:
2.3.2 Introducing slack variables
The loss function maximizes the geometric interval while minimizing the sum of slack variables. Now, we allow some training sets to fall between the two classification planes, and even allow some training samples to be misclassified
- , The sample falls on the classification boundary or is correctly classified.
- , The sample falls in the middle of the classification boundary, but is correctly classified, such as
- , The sample is misclassified, such as
2.3.3 The dual problem of the linear inseparable case
Here, support vectors ( those of) include those that fall on the boundary, those that fall between two classification boundaries, and those that are misclassified.
3. Kernel Support Vector Machine
3.1 Implicit mapping of feature space: kernel function
In the case of linear inseparability, the support vector machine first completes the calculation in the low-dimensional space, and then maps the input space to the high-dimensional feature space through the kernel function, and finally constructs the optimal separation hyperplane in the high-dimensional feature space, thus Separate the non-linear data that is not easy to separate on the plane. As shown in the figure, a bunch of data cannot be divided in two-dimensional space, so it is mapped to three-dimensional space:
3.2 Kernel function: how to deal with non-linear data
Let's look at an example of a kernel function. The two types of data shown in the following figure are distributed in the shape of two circles. Such data itself is linear and inseparable. How do we separate the two types of data at this time?
In fact, the data set described in the figure above is generated by using two circles with different radii and a small amount of noise. Therefore, an ideal boundary should be a "circle" instead of a line (hyperplane) . If the sum is used to represent the two coordinates of this two-dimensional plane, we know that the equation of a circle can be written in this form:
Note that the above form, if we construct another five-dimensional space, where the value of the five-coordinate, respectively , , , , , then obviously, the above equation in the new coordinates can be written as:
Regarding the new coordinates , this is exactly the equation of a hyperplane! That is to say, if we make a mapping , it will be mapped to according to the above rules , then the original data in the new space will become linearly separable, so that it can be processed using the linear classification algorithm we derived before. This is the basic idea of the Kernel method to deal with nonlinear problems.
Before further describing the details of Kernel, let's take a look at the intuitive form of the above example after mapping. Of course, you and I might not be able to draw out of 5 dimensional space, but due to the time data generated here by special circumstances, so the actual hyperplane equation here is three-dimensional, just to have it mapped to Therefore , , so a three-dimensional Just in space, the following figure is the result of mapping from 2D to 3D:
3.3 Kernel function and its construction
The function that calculates the inner product of two vectors in the implicitly mapped space is called the kernel function.
If we have a function k(x,y) , which is equal to , then we do not need to represent the features() explicitly. k(x,y) is called the kernel function.
Among them, the Gaussian kernel will map the original space to an infinite dimensional space. However, if the parameters are selected very large, the weights on the high-order features actually decay very fast, so in fact (numerical approximation) is equivalent to a low-dimensional subspace; conversely, if the parameters are selected very small , Then any data can be mapped to linearly separable-of course, this is not necessarily a good thing, because there may be a very serious overfitting problem . However, in general, Gaussian kernels are actually quite flexible through control parameters , and they are also one of the most widely used kernel functions. The example shown in the figure below is to map the low-dimensional linearly inseparable data to the high-dimensional space through the Gaussian kernel function:
to sum up:
Supplement: About the topic of SVM
Regarding the support vector machine SVM, the following statement is wrong (C)
A. L2 regular term, the function is to maximize the classification interval, so that the classifier has a stronger generalization ability
B. Hinge loss function, the function is to minimize the empirical classification error
C. The classification interval is 1/||w||,| |w|| represents the modulus
D of the vector . When the parameter C is smaller, the classification interval is larger, the classification errors are more, and it tends to be under-learningWhen the feature is larger than the amount of data, what classifier should I choose?
Answer: Linear classifiers, because when the dimensionality is high, the data is generally sparse in the dimensional space, and it is likely to be linearly separable.
Reference materials:
[1] THU 2020 Spring Machine Learning Introduction Courseware