SVM interview questions often test

Reproduced original source: blog.csdn.net/szlcw1

SVM What is the principle?

SVM is a second-class classification model. The basic model is maximized to find the interval separating hyperplane linear classifier in the feature space. (The biggest gap is that it is different from the perception machine)

(1) when linearly separable training sample, by maximizing the hard spacer, a linear classifier learning, i.e. linearly separable support vector machine;

(2) when the training data is approximately linearly separable, slack variables introduced, by maximizing the soft spacer, a linear classifier learning, i.e., linear support vector machine;

(3) When the non-linear training data sharing, through the use of nuclear techniques to maximize distance and soft, non-linear support vector machine learning.

Note: the above mathematical derivation should be familiar with SVM: maximize hard interval (geometric distance) - Learning dual problem - soft margin maximization (the introduction of slack variables) - non-linear support vector machine (kernel trick).

Why maximize the use of SVM interval?

When training data is linearly separable, there are an infinite number separating hyperplane may be two separate data correctly.

Perceptron use misclassification minimum strategy, seek separation hyperplane, but this time there are an infinite number of solution.

SVM using linearly separable find an optimum spacing to maximize the separating hyperplane, then, is the only solution. On the other hand, at this time separating hyperplane is generated most robust classification result, the generalization ability of the strongest known examples.

Then should take forth, geometric interval, the interval function, and the function of the separation -> MINIMIZATION 1/2 || w || ^ 2 when w and b. A That is linearly separable support vector machine learning algorithm - the origin of the maximum interval method.

Why should the original SVM problem solving into its dual problem?

First, it is often easier to solve the dual problem (when we find the most advantage of the time constraints exist when the constraint although there is a need to reduce the scope of the search, but filling the problem becomes more complex. In order to make the problem becomes tractable, our approach is to integrate all the objectives and constraints of a new function, namely Lagrange function, and then to find the best advantages of this function.)

Second, natural kernel functions, and then extended to nonlinear classification problem.

Why should SVM kernel functions?

When the sample in the original non-linear time-space, the sample may be a mapping from the original space to a higher dimensional feature space, so that the sample can be divided linearly in the feature space.

After the introduction of the mapping dual problem:

Learning prediction, only defines a kernel function K (x, y), rather than explicitly defined mapping function φ. Because the feature space dimension may be high, perhaps even infinite dimension, therefore directly calculate φ (x) · φ (y) is more difficult. In contrast, direct calculation of K (x, y) is relatively easy (i.e., directly calculated in the original low-dimensional space, the mapped result without the need to explicitly write).

Defined kernel function: K (x, y) = <φ (x), φ (y)>, i.e., their product is equal to the result calculated by the kernel function K in the original sample space in the feature space.

In addition to the SVM, any method of calculation is expressed as the product of the data points, can be extended using non-linear kernel method.

Specific formula svm RBF kernel function?

Gauss radial basis function is a strong local kernel function, the ability to extrapolate the parameter σ increases as weakened.

The core of the original space will be mapped to infinite dimensional spaces. However, if σ selected very large, the weight heavy wherein higher actually decay very quickly, so in fact, (once on approximate values) corresponds to a low-dimensional subspace; Conversely, if the selected very small σ , you can use any data mapping is linearly separable - of course, this is not necessarily a good thing, because the attendant could be a very serious problem of over-fitting. Overall, however, by regulating the parameters σ, Gaussian kernel actually has a very high flexibility, it is the most widely used kernel function.

Why SVM sensitive to missing data?

Here that means missing some of the missing data feature data, vector data is incomplete. SVM strategy did not deal with missing values (decision trees have). The desired sample linearly separable SVM in feature space, the quality of performance SVM feature space is important. Missing data features will affect the quality of training result.

SVM is used which library? Sklearn / libsvm in SVM has what parameters can be adjusted?

Using a sklearn achieve. Sklearn.svm.SVC using the parameters set. This function is also based itself libsvm implementation (PS: quadratic programming algorithm to solve the problem libsvm is SMO).

SVC training time is a function of the square of the sample with the training level growth, it is not suitable for more than 10,000 samples.

For multi-classification, SVC uses a one-vs-one voting mechanism, they are necessary to establish twenty-two category classification, training time is rather long.

sklearn.svm.SVC(C=1.0, kernel=‘rbf’, degree=3, gamma=‘auto’, coef0=0.0, shrinking=True, probability=False,tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None,random_state=None)

parameter:

l C:? C-SVC penalty parameter C default value is 1.0

The larger C, corresponding to the slack variable penalty desirable slack variable is close to 0, i.e. punishment misclassification increases, tend to be the case for the full diversity training set, so that when very high accuracy of the test the training set, but the generalization weak capacity. C value is small, penalties for misclassification is reduced, allowing the fault-tolerant, point to them as noise, strong generalization ability.

l kernel: kernel, RBF is the default, may be 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'

0 - Linear: u'v

1-- polynomial: (gamma * u '* v + coef0) ^ degree

2 - RBF function: exp (-gamma | uv | ^ 2)

3 –sigmoid：tanh(gamma*u’*v + coef0)

l degree: Dimensions poly polynomial function, the default is 3, will be ignored when selecting other core functions.

l gamma: 'rbf', 'poly' and 'sigmoid' of kernel function. The default is the 'auto', will be selected 1 / n_features

l coef0: the constant term of the kernel function. Useful for 'poly' and 'sigmoid'.

l probability: whether to use probability estimates? The default is False

l shrinking: whether to adopt shrinking heuristic method, the default is true

l tol: stop training error value magnitude, defaults to 1e-3

l cache_size: kernel cache buffer size, default is 200

l class_weight: heavy weight category, the dictionary form of delivery. Provided several types of parameter C is weight * C (C C-SVC in)

l verbose: allows redundant outputs?

l max_iter: maximum number of iterations. -1 unlimited.

l decision_function_shape ：‘ovo’, ‘ovr’ or None, default=None3

l random_state: data shuffling seed value, int value

The main adjustment parameters are: C, kernel, degree, gamma, coef0.

How to deal with SVM multi-classification problem?

There are two approaches: one is the direct method, the objective function changes directly on the surface of a plurality of classification parameters incorporated into solving an optimization problem inside. But the seemingly simple calculation amount is very large.

Another approach is indirect method: combining the training device. The more typical are one and one to many.

Many, that is, for each class train a classifier by svm is binary, so while this binary classification is set as the target class as a class, the rest of the class to another class. Such classes can be trained for k out of k classifiers, when there is a new sample comes with the k classifiers to test high probability that the classifier, then this sample belongs to which category. This method is not very good, bias is relatively high.

svm one method (one-vs-one), a classifier trained with respect to any two classes, if there are k classes, training a total of C (2, k) classifiers, so that when there is a new sample to comes with which C (2, k) classifiers tested, whenever it is determined that belong to a class, the class is incremented, and finally the most votes was identified category class for the sample.

References:

Li Hang "statistical learning methods" (highly recommended)

Zhou Zhihua "machine learning"

Part of the online data compilation.

SVM: Kernel http://blog.pluskid.org/?p=685&cpage=1
---------------------

Original: https://blog.csdn.net/szlcw1/article/details/52259668