Support Vector Machine (Essay)

1. Support Vector Machine (SVM)

Support vector machine is a powerful and comprehensive machine learning model that can perform linear or nonlinear classification, regression, and even outlier detection tasks. SVM is particularly suitable for the classification of small and medium-sized complex data sets.
What is a machine? What is a vector? What is support? Machine means algorithm, why not call it because of some historical reasons. A vector is a vector, and each sample is represented by a vector when doing classification. Support means to hold up and decide. When doing classification, we usually find a classification hyperplane (line in the case of two latitudes), and the hyperplane side is the positive class side and the negative class. What is a support vector is a vector that determines this classification hyperplane. Therefore, the support vector machine can be understood as follows: it is a classification algorithm. This algorithm is classified by a hyperplane, which is determined by the support vector.

The SVM model has two very important parameters C and gamma. Where C is the penalty coefficient, which is the tolerance for errors . The higher the c, the more intolerable the error is, and it is easy to overfit. The smaller C is, the more tolerant errors appear, and easy to underfit.
Gamma is a parameter that comes with the RBF function as the kernel. It implicitly determines the distribution of the data after mapping into a new feature space. The larger the gamma, the fewer the support vectors, and the smaller the gamma value, the more the support vectors. The number of support vectors affects the speed of training and prediction.

1.1 Linear SVM classification

SVM is very sensitive to the scaling of features. Unlike the Logistic regression classifier, the SVM classifier does not output the probability of each category. As shown in Figure 5-2, in the left picture, the vertical scale is much larger than the horizontal scale, so the widest possible street is close to horizontal. After feature scaling (for example, using Scikit-Learn's StandardScaler), the decision boundary looks much better (see right).
Insert picture description here

Soft interval classification:

If we strictly prevent all instances from being on the street and on the right side, this is hard partition classification. There are two main problems with hard-interval classification. First, it is only effective when the data is linearly separable. Second, it is very sensitive to outliers.
To avoid these problems, it is best to use a more flexible model. The goal is to find a good balance between keeping the street wide and restricting interval violations (that is, instances that are on the street or even on the wrong side) as much as possible. This is the soft interval classification.
Hyperparameter C can control this balance: the smaller the value of C, the wider the street, but the more interval violations. If it is overfitting, the value of c can be reduced to regularize.

Insert picture description here

1.2 Non-linear SVM classification

One way to deal with non-linear data sets is to add more features, such as polynomial features, which in some cases may cause the data set to become linearly separable.After adding an x ​​^ 2 feature, it becomes separable

Polynomial kernel

Adding polynomial features is very simple to implement and very effective for all machine learning algorithms. But if the polynomial is too low-order, it can't handle very complex data sets, and the high-order will create a large number of features, causing the model to become too slow.
The result of the kernel technique is the same as adding many polynomial features, even very high-order polynomial features, but it does not actually need to be added. Because no features are actually added, there are no combinatorial features that explode in quantity. This technique is implemented by the SVC class.

Add similar features

Another technique for solving nonlinear problems is to add similar features. These features are calculated by a similarity function, which can measure the similarity between each instance and a specific landmark.
Gaussian RBF kernel function The function of the
kernel function: transform the completely inseparable problem into a separable or approximately separable state.
(Gaussian) Radial basis function kernel (English: Radial basis function kernel), or RBF kernel, is a commonly used kernel function. It is the most commonly used kernel function in support vector machine classification.
Insert picture description here
There are two landmarks -2, 0 at the beginning, x 1 = 1 x_1=-1 , the distance from the two coordinates is 1 and 2, so its new feature is:
Insert picture description here
After adding new features, separation is easy
the simplest way to select a landmark is to create a landmark at the location of each instance, but if the training set is very large, you will get Large number of features.

Computational complexity

With so many kernel functions, how do you decide which one to use? A
rule of thumb is to always start with a linear kernel function (remember, LinearSVC is
much faster than SVC (kernel = “linear”)), especially when the training set is very large or has many features
. If the training set is not too large, you can try the Gaussian RBF kernel,
which is very useful in most cases . If you still have extra time and computing power, you can use cross-
validation and grid search to try some other kernel functions, especially those that are specific to your
data set data structure
Insert picture description here

Insert picture description here

SVM regression

SVM not only supports linear and nonlinear classification, but also supports linear and nonlinear regression.
Regression is no longer an attempt to fit the widest possible street between two categories while limiting interval violations. What SVM regression does is to place as many instances as possible on the street and limit interval violations (that is, not on the street Example).
Adding more instances in the interval will not affect the prediction of the model, so this model is called ε-insensitive.
Insert picture description here

m i n      1 2 w 2 2 + C i = 1 m ξ i min\;\; \frac{1}{2}||w||_2^2 +C\sum\limits_{i=1}^{m}\xi_i s . t .      y i ( w ϕ ( x i ) + b ) 1 ξ i      ( i = 1 , 2 , . . . m ) s.t. \;\; y_i(w \bullet \phi(x_i) + b) \geq 1 - \xi_i \;\;(i =1,2,...m) ξ i 0      ( i = 1 , 2 , . . . m ) \ xi_i \ geq 0 \; \; (i = 1,2, ... m)

problem

1. What is the basic idea of ​​support vector machine?

Answer: The basic idea of ​​support vector machines is to fit the widest possible "street" between categories. In other words, its purpose is to maximize the interval between decision boundaries, thereby separating the training examples of the two categories (reverse regression). When SVM performs soft-interval classification, it actually compromises between perfect classification and the widest street fit (that is, allows a few instances to end up on the street). There is a key point in training non-linear data set, remember to use the kernel function .

2. What is a support vector?

Answer: After the training of the support vector machine is completed, the instance on the "street" (refer to the previous answer) is called the support vector (that is, it is outside the two lines, and it is within the two lines when returning) Instances on the border. The decision boundary is completely determined by the support vector. Instances of non-support vectors (that is, instances outside the street) have no effect at all, you can choose to delete them and add more instances, or move them away, as long as they are always outside the street, they will not be The decision boundary has no effect. Calculating the prediction result will only involve the support vector, not the entire training set.

3. Why is it important to scale the input value when using SVM?

Answer: The support vector machine fits the widest possible “street” between categories, so if the training set is not scaled, the SVM will tend to ignore features with smaller values ​​(see Figure 5-2).

4. If the training set has tens of millions of instances and hundreds of features, should you use the original SVM problem or the dual problem to train the model?

Answer: This problem only applies to linear support vector machines, because the kernel SVM can only use the dual problem. For the SVM problem, the computational complexity of the original form is proportional to the number of training examples, while the computational complexity of its dual form is proportional to a number between m2 and m3. So if the number of instances is millions, be sure to use the original problem, because the dual problem will be very slow.

5. Suppose you train an SVM classifier with RBF kernel, it seems to fit the training set insufficiently, should you increase or decrease γ (gamma)? What about C?

Answer: If a support vector machine that uses RBF kernel training has insufficient fitting to the training set, it may be caused by excessive regularization. You need to increase gamma or C (or both) to reduce regularization. (See detailed explanation at the top)

Note

Since I am a beginner, I still do n’t know much about the algorithm parameters, so I borrow a lot from the excerpts and merge them, so I do n’t have a famous source here. I hope to understand. I didn't write much code, mainly because I felt that writing would make the space too large, and I didn't write much reasoning in formulas, (the key is that I was a freshman, most of them can't understand it), so I wrote some basic knowledge concisely. If you want to engage in internal algorithms, I feel that this article can be linked to the original text: https://blog.csdn.net/v_JULY_v/article/details/7624837 In
addition, this article mainly refers to "Machine learning combat: based on Scikit-Learn and TensorFlow" If there is a demand for this book (there is still a lot of information about Python and machine learning), you can follow and privately mail me, and it will be delivered one by one.
After all, novices, if you have mistakes, you still want to enlighten and support, do n’t like to spray, and hope to find friends who write like-minded and learn together.

Published 2 original articles · praised 5 · visits 113

Guess you like

Origin blog.csdn.net/weixin_45755332/article/details/105423765