Machine Learning (2) - Support Vector Machine SVM

1. What is the principle of SVM?

SVM is a two-class classification model. Its basic model is a linear classifier that finds a separation hyperplane that maximizes the margin in the feature space. (The maximum interval is that it is different from the perceptron) 
Try to find a hyperplane to segment the sample, separate the positive examples and negative examples in the sample with a hyperplane, and maximize the interval between positive examples and negative examples as much as possible.

 
The basic idea of ​​SVM can be summarized as: firstly transform the input space into a high-dimensional space through nonlinear transformation, and then find the optimal classification surface in this new space, that is, the maximum interval classification surface, and this nonlinear transformation is This is achieved by defining an appropriate inner product kernel function. SVM is actually proposed according to the principle of structural risk minimization based on statistical learning theory, and requires two purposes:

1) The two types of problems can be separated (minimum empirical risk)

2) Margin maximization (minimum risk upper bound) is both a function of selecting the minimum empirical risk function in the subset with the minimum guaranteed risk.

It is divided into three types of support vector machines: 
(1) When the training samples are linearly separable, a linear classifier is learned by maximizing the hard interval, that is, a linearly separable support vector machine;

(2) When the training data is approximately linearly separable, a slack variable is introduced, and a linear classifier, namely a linear support vector machine, is learned by maximizing the soft margin;

(3) When the training data is linearly inseparable, learn nonlinear support vector machines by using kernel tricks and soft margin maximization.

Note: The mathematical derivation of the above SVMs should be familiar with: hard margin maximization (geometric margin) - dual problem of learning - soft margin maximization (introducing slack variables) - nonlinear support vector machine (kernel trick).

 

2. The main features of SVM

(1) Nonlinear Mapping - Theoretical Basis (2) Maximizing Classification Boundaries - Method Core (3) Support Vectors - Computational Results (4) Small-Sample Learning Method (5) The final decision function is determined by only a small number of support vectors, avoiding the " The curse of dimensionality" (6) A small number of support vectors determine the final result --> a large number of redundant samples can be "removed" + the algorithm is simple + robust (reflected in 3 aspects) (7) The learning problem can be expressed as a convex optimization problem ---> global minimum (8) can automatically control the model by maximizing the boundary, but requires the user to specify the kernel function type and introduce slack variables (9) suitable for small samples, excellent generalization ability (because the structural risk is minimal) (10) The generalization error rate is low, the classification speed is fast, and the results are easy to interpret.

Disadvantages: (1) Large-scale training samples (m-order matrix calculation) (2) Traditional ones are not suitable for multi-classification (3) Sensitive to missing data, parameters, and kernel functions

Why is SVM sensitive to missing data?

Missing data here refers to missing some feature data, and the vector data is incomplete. SVMs don't have a strategy for dealing with missing values ​​(decision trees do). SVM expects samples to be linearly separable in the feature space, so the quality of the feature space is very important to the performance of SVM. Missing feature data will affect the quality of the training results.

 

3. Why does SVM use interval maximization?

When the training data is linearly separable, there are infinite separation hyperplanes that can correctly separate the two types of data. The perceptron uses the misclassification minimum strategy to obtain the separation hyperplane, but there are infinitely many solutions at this time.

The linear separable support vector machine uses the interval maximization to obtain the optimal separating hyperplane, at this time, the solution is unique. On the other hand, the classification results produced by the separating hyperplane at this time are the most robust and have the strongest generalization ability to unknown instances.

Then it should be explained, geometric interval, functional interval, and w and b when minimizing 1/2 ||w||^2 from functional interval —> Solving. That is, the linear separable support vector machine learning algorithm - the origin of the maximum interval method.

 

4. Why convert the original problem of solving SVM into its dual problem?

One is that dual problems are often easier to solve (when we look for the optimal point when constraints exist, the existence of constraints reduces the scope of the search, but it makes the problem more complicated. In order to make the problem tractable, Our method is to integrate the objective function and constraints into a new function, that is, the Lagrangian function, and then use this function to find the optimal point.) 
Note: Lagrangian duality does not change the optimal solution, but changes the algorithm Complexity: primal problem - sample dimension; dual problem - sample size. So linear classification -> sample dimension < number of samples: original problem solution (liblinear default); nonlinear - rising dimension -> generally lead to sample dimension > sample number: dual problem solution.

Second, the kernel function is naturally introduced, and then extended to nonlinear classification problems.

 

5. Interpret support vectors

Definition for linearly separable case + definition for linearly inseparable case. (Statistical learning method) 
(1) Several equivalent definitions of linearly separable SVM to SV (2) Several equivalent definitions of linear SVM to SV (3) Compare the definition of SV of linearly separable SVM and linear SVM to SV Differences and connections between definitions

Why does SVM introduce kernel functions?

When the samples are linearly inseparable in the original space, the samples can be mapped from the original space to a higher-dimensional feature space, so that the samples are linearly separable in this feature space.

Introducing the dual problem after mapping: 

In learning prediction, only the kernel function K(x,y) is defined instead of the explicit mapping function ϕ. Because the dimension of the feature space may be very high or even infinite, it is difficult to directly calculate ϕ(x) ϕ(y). Conversely, it is easier to compute K(x,y) directly (i.e. compute directly in the original low-dimensional space without explicitly writing out the mapped result).

Definition of kernel function: K(x,y)=<ϕ(x),ϕ(y)>, that is, the inner product in the feature space is equal to the result calculated by the kernel function K in the original sample space.

Any method other than SVM that represents computations as inner products of data points can be extended nonlinearly using kernel methods.

 

6. What is the specific formula of the svm RBF kernel function? (Gaussian kernel function, also known as Radial Basis Function (RBF). It can map original features to infinite dimensions)

 

The advantages of the RBF kernel
it is suitable for both sizes. Specifically (1) infinite dimension, linear kernel is its special case (2) compared with polynomial ~, RBF needs to determine less parameters (3) under certain parameters, it has similar functions to sigmoid~.

The Gauss radial basis function is a kernel function with strong locality, and its extrapolation ability weakens with the increase of the parameter σ.

This kernel maps the original space into an infinite-dimensional space. However, if σ is chosen to be large, the weights on higher-order features actually decay very quickly, so it is actually (numerically approximated) equivalent to a low-dimensional subspace; conversely, if σ is chosen very small , you can map arbitrary data to be linearly separable - of course, this is not necessarily a good thing, because there may be a very serious overfitting problem. However, in general, by manipulating the parameter σ, the Gaussian kernel is actually quite flexible and one of the most widely used kernel functions.

 

7. How does SVM handle multi-classification problems?

There are generally two methods: one is the direct method, which is directly modified on the objective function, and the parameter solutions of multiple classification surfaces are combined into one optimization problem. It seems simple, but the amount of calculation is very large.

Another approach is indirection: combining trainers. The more typical ones are one-to-one and one-to-many.

One-to-many, that is, a classifier is trained for each class, and svm is a two-class, so the two classes of this two-classifier are set as the target class as one class, and the other classes are another class. In this way, k classifiers can be trained for k classes. When a new sample comes, use these k classifiers to test, and the probability of that classifier is high, then which class this sample belongs to. This method is not very effective, and the bias is relatively high.

The svm one-to-one method (one-vs-one) trains a classifier for any two classes. If there are k classes, a total of C(2,k) classifiers are trained, so that when a new sample is required When it comes, use these C(2,k) classifiers to test. Whenever it is judged to belong to a certain class, the class is added by one, and the class with the most votes is finally identified as the class of the sample.

 

8. The difference and connection between SVM and LR

Contact: (1) Classification (two classifications) (2) Regularization terms can be added to 
distinguish: (1) LR-parametric model; SVM-nonparametric model? (2) Objective function: LR-logistical loss; SVM-hinge loss (3) SVM-support vectors; LR-reduce the weight of distant points (4) LR-model is simple, easy to understand, low in accuracy, and may be locally optimal; SVM - complex understanding and optimization, high precision, global optimality, converted into dual problem -> simplified model and calculation (5) SVM that LR can do can do (linearly separable), but LR that SVM can do may not be able to do ( Linearly inseparable)

The relationship between kernel function selection and features and samples 
(1) large fea ≈ number of samples: LR or linear kernel (2) small fea, and the number of samples is not small or small: Gaussian kernel (3) large fea, large number of samples: Manually add features and then turn (1)

The Kernel function is actually a similarity measure of the input data. The input vector forms a similarity matrix K (Gram Matrix/Similarity/Kernel Matrix), and K is symmetric positive semi-definite.

The necessary and sufficient conditions for K(x,z) to be a positive definite kernel are: the Gram matrix corresponding to K(x,z) is a real semi-positive definite matrix. 
Gram matrix: The inner product of the corresponding points of the matrix. KTK, KKT 
Semi-positive definite matrix: Let A be a real symmetric matrix. A is said to be a positive semi-definite matrix if XTAX≥0 for any real non-zero column matrix X. 
When checking whether a K is a positive definite kernel function, it is necessary to verify whether the Gram matrix corresponding to K is a positive semi-definite matrix for any finite input set {xi…}.

 

9. Which library does SVM use? What parameters can be adjusted for the SVM in Sklearn/libsvm?

It is implemented using sklearn. Takes parameters set by sklearn.svm.SVC. This function itself is also implemented based on libsvm (PS: The solution algorithm for the quadratic programming problem in libsvm is SMO).

The training time of the SVC function increases with the square of the training samples, so it is not suitable for more than 10,000 samples.

For multi-classification problems, SVC uses a one-vs-one voting mechanism, which requires two or two categories to establish classifiers, and the training time may be relatively long.

sklearn.svm.SVC(C=1.0, kernel=’rbf’, degree=3, gamma=’auto’, coef0=0.0, shrinking=True, probability=False,tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None,random_state=None)

parameter:

l C: C-SVC penalty parameter C? The default value is 1.0

The larger C is, it is equivalent to penalizing the slack variable. It is hoped that the slack variable will be close to 0, that is, the penalty for misclassification will increase, and it will tend to the situation where the training set is fully divided. In this way, the accuracy rate of the training set test is high, but the generalization Weak ability. The small value of C reduces the penalty for misclassification, allows fault tolerance, treats them as noise points, and has a strong generalization ability.

kernel  : kernel function, default is rbf, can be 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'

  0 – Linear: u'v

   1 - Polynomial: (gamma*u'*v + coef0)^degree

  2 – RBF function: exp(-gamma|uv|^2)

  3 –sigmoid:tanh(gamma*u’*v + coef0) 

First, the parameters corresponding to the kernel function are introduced: 
1) For the linear kernel function, there are no special parameters that need to be set. 
2) For the polynomial kernel function, there are three parameters. -d is used to set the highest degree of the polynomial kernel function, which is the d in the formula. The default value is 3. -g is used to set the gamma parameter setting in the kernel function, which is the first r(gamma) in the formula. The default value is 1/k (k is the number of categories). -r is used to set the coef0 in the kernel function, which is the second r in the formula. The default value is 0. 
3) For the RBF kernel function, there is a parameter. -g is used to set the gamma parameter setting in the kernel function, which is the first r(gamma) in the formula. The default value is 1/k (k is the number of categories). 
4) For the sigmoid kernel function, there are two parameters. -g is used to set the gamma parameter setting in the kernel function, which is the first r(gamma) in the formula. The default value is 1/k (k is the number of categories). -r is used to set the coef0 in the kernel function, which is the second r in the formula. The default value is 0.

l degree : the dimension of the polynomial poly function, the default is 3, it will be ignored when other kernel functions are selected.

l gamma : kernel function parameters for 'rbf', 'poly' and 'sigmoid'. The default is 'auto', which will select 1/n_features

l coef0 : the constant term of the kernel function. Useful for 'poly' and 'sigmoid'.

l probability: whether to use probability estimation? .default is False

l shrinking: Whether to use shrinking heuristic method, the default is true

l tol : The size of the error value to stop training, the default is 1e-3

l cache_size : kernel function cache cache size, the default is 200

l class_weight : The weight of the class, passed in dictionary form. Set the parameter C of the first class to weight*C (C in C-SVC)

l verbose : allow redundant output?

l max_iter : the maximum number of iterations. -1 for unlimited.

l decision_function_shape :‘ovo’, ‘ovr’ or None, default=None3

l random_state: the seed value when the data is shuffled, int value

The main adjustment parameters are: C, kernel, degree, gamma, coef0.

 

10. SMO algorithm implements SVM

The basic idea is to decompose a large optimization problem into multiple small optimization problems. These small problems are often easier to solve, and the results of solving them sequentially are exactly the same as the results of solving them as a whole. 
Process: 

question: 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325264790&siteId=291194637