Detailed sklearn in svm

1, SVC function prototypes

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=‘auto’, kernel=‘rbf’,
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

2. Parameter Description

C: penalty parameter. The default value is 1.0, the larger C, corresponding to the slack variable penalty desirable slack variable is close to 0, i.e. punishment misclassification increases, tend to be the case for the full diversity of the training set, so that the accuracy of the test when the training set is high, but easy to over-fitting, high variance, generalization weak. C small value, reduced penalties for misclassification, allow fault-tolerant, point to them as noise, strong generalization ability, but too small to fit easily due to high bias.
cache_size : kernel cache buffer size, the default is 200.
class_weight : heavy weight category, the dictionary form of delivery. Provided several types of parameter C is weight * C (C C-SVC in)
coef0 : the constant term kernel function. Useful for 'poly' and 'sigmoid'.
decision_function_shape ：‘ovo’, ‘ovr’ or None, default=ovr
Degree : Dimensions poly polynomial function, the default is 3, will be ignored when selecting other core functions.
Gamma : 'RBF', 'poly' and 'sigmoid' of kernel function. The default is the 'auto', if auto, the value of 1 / n_features
Kernel : kernel function. The default is' rbf ', may be' Linear ',' poly ',' rbf ',' Sigmoid ',' precomputed '
　　0 - linear: the u'v
　　. 1 - Polynomial: (Gamma * U' V + coef0) Degree ^
　　2 - RBF function: exp (-gamma | UV | ^ 2)
　　. 3 - Sigmoid: tanh (Gamma U '* V + coef0)
max_iter : maximum number of iterations. -1 unlimited.
Probability : whether to use probability estimates? The default is False
random_state : seed value of data shuffling, int value
Shrinking : whether to adopt shrinking heuristic method, the default is true
Tol : stop training error value magnitude, defaults to 1e-3
verbose : allow for redundancy output

Detailed specific parameters

1, the parameter C

Is to do between generalization and accuracy trade-offs, in general, do not need to modify, if not strong generalization ability, can reduce the value of C, that is: the larger the C value, the better the test suite results, but may be over-fitting, the smaller C is, the better ability to tolerate errors, the effect on the test set is worse. Svm parameters C and gamma are two very important parameters to adjust the value of its directly determines the overall quality of the final model.

2, parameter cache_size

This argument obviously, is set svm allowed during training cache size, the default is 200.

3, parameter class_weight

Dictionary type or 'balance' string, the default is None.
Respectively provided for each different category penalty parameter C, If no, then he will give all categories C = 1, i.e. the parameters previously indicated C.

4, parameters coef0

The default parameter value is 0, and a poly sigmoid kernel function constant term when Solution <x, y> value is approaching, there is no clear distinction, the problem to measure the difference between the poly function for different values, the general The default value is 0 can be. Reflecting the impact of higher order polynomial with respect to the low-order polynomial model, if the over-fitting phenomenon, you can reduce coef0; if you owe fitting problems, you can try to increase coef0.

5, the parameter degree

Only for poly, when the amount of data reaches20 w or more, three layers degree of convergence would be extremely slow, not recommended for use in practical applications.

6, the parameter gamma

It is a floating point number as a parameter of the three core functions, implicitly determines the distribution map data to the new feature space, gamma greater, the less support vector. The smaller the gamma value, the more support vector. The number of support vectors affect the speed of training and predictable.

7, parameter kernel
represents svm kernel function of different kernel functions impact on the final classification effect is relatively large, which precomputed said he calculate the kernel matrix in advance, this time inside the kernel algorithm not be used again to calculate the nuclear matrix , but directly with you to the nuclear matrix, we talk about the other several.

Linear kernel κ (x, xi) = x⋅xi

Linear kernel, mainly for the case of linearly separable, we can see the dimensions of the input space to the feature space is the same, which is less fast speed parameter, for linearly separable data, the classification results are satisfactory, we typically first try classification with linear kernel to do, to see how effective, if not back to another

Polynomial kernel function

κ (x, xi) = ( (x⋅xi) +1) d
polynomial kernel function may be implemented to map an input space to the low-dimensional feature space of latitude, but more polynomial kernel function parameters, when the order of the polynomial Comparison time high, the nuclear matrix element values will tend to infinity or infinitely small, large computational complexity will not be calculated.

Gauss (RBF) kernel function

κ (x, xi) = exp (- || x-xi || 2δ2)
Gaussian radial basis function is a strong local kernel that a sample can be mapped into a higher dimensional space, the kernel is the most widely used one, whether large or small sample sample has a relatively good performance, but also with respect to nuclear polynomial function parameter is less, so I do not know what kernel function used when using a priority in most cases Gaussian kernel.

sigmoid kernel function

κ (x, xi) = tanh (η <x, xi> + θ) used sigmoid kernel function, support vector machine implementation is a multilayer neural network.
Therefore, in the selection of kernel function of time, if we have some prior knowledge of our data, we use a priori be selected in line with data distribution kernels; if you do not know, usually using cross-validation methods to try different kernel function, the error is the best lowermost kernel function, or may be a plurality of kernel functions combine to form a hybrid kernels.

Finally:
on Andrew Ng's class, has also been given a series of kernel function method:
If a large number of features and the similar number of samples, the choice of linear or LR nuclear SVM;
if a small number of features, samples the number of normal, Gaussian kernel SVM + selected;
if a small number of features, and a large number of samples, it is necessary to add some features to become the first case manually.
In fact, can be tested using multiple kernels, choose the best performance of core functions.

8, parameters probability

The default is False decide whether the final output according to the probability of the probability of each possible, but need to pay attention to the last prediction function should be changed clf.predict_proba

9, parameters shrinking

The default parameter is True, which uses heuristic contraction mode, if what variables can predict the corresponding support vector, as long as enough training on these samples, other samples may not be considered, this does not affect the training results, but reduced the problem the scale and helps to solve quickly, play an accelerated training effect.

10, parameter tol

Svm training error precision stop, float parameter defaults to 1e ^ -3, stopping accuracy of the standard training.

11, the parameter verbose

Whether to enable verbose output.
Run time for each process in this setting use libsvm, if enabled, it may not work correctly in a multithreaded context. General are set to False, do not ignore it. An example of the code above, the verbose output is open:

You can also refer to blog: https://xijunlee.github.io/2017/03/29/sklearn in SVM parameter adjustment instructions and lessons learned /