[Study Notes] Chapter 6 of Xigua Book Machine Learning: Support Vector Machine and Code Implementation and Parameter Adjustment

Preface

Support vector machine, hereinafter referred to as SVM, is the most difficult (mathematically) algorithm among all algorithms currently being learned. But it is also the most widely used and popular star algorithm among the current learning algorithms. It can be used for classification, regression, and anomaly detection.

1. What is hard margin SVM?

What divides positive and negative examples in a two-dimensional space is a straight line. In a multi-dimensional space, we can imagine a hyperplane that can help us complete the classification task.

The hard margin support vector machine has two purposes: 1 is to perfectly divide the positive and negative examples, and 2 is to keep the hyperplane as far away from the nearest positive and negative examples as possible (the positive and negative vectors circled in red in the figure below are also called support vectors ), or looking for the maximum interval (the distance between two dotted lines). Purpose 1 is what all classifiers should strive to pursue. Purpose 2 can be seen from the figure below. There are two other diagonal lines that can also separate positive and negative examples. Imagine that there is a point very close to these two lines. Slash, the learner results are prone to errors. Therefore, his generalization ability must be weaker than the line in the middle.

Insert image description here

2.How does SVM work?

Like logistic regression and multiple linear regression, SVM also has its own loss function (the formula in the figure below). This convex optimization problem is actually to solve the maximum interval.
Insert image description here

Because mathematics is indeed too difficult for me , let me put it briefly. The solutions to this convex optimization problem generally include SMO, Lagrangian and gradient descent. Two points need to be remembered: 1. When the characteristics of our data samples are much larger than the number of samples, it is recommended to use Lagrangian courtship, because the efficiency will be higher. 2. The difficulty of solving is directly proportional to the number of samples m.

3. Kernel function

Kernel function is a difficult concept, I will explain it in my own vernacular. When encountering some linearly inseparable data sets, we can map the samples from the original space to a higher-dimensional feature space (such as from two dimensions to three dimensions), making this originally inseparable data set separable. Xigua Shu’s original words: If the original space is finite-dimensional, that is, the number of attributes is limited, then there must be a high-latitude space that makes the sample separable . But a problem arises. If the dimensions of these feature spaces are very high, or even infinite, it is very difficult to directly calculate the inner product between these dimensions. At this time, the kernel function comes with its nuclear magic. The kernel function can The result calculated in the original sample space is equal to the inner product of the high-dimensional feature space .

There are four common kernel functions: linear, poly, rbf, and sigmoid. The names here can completely correspond to sklearn, as shown in the code below.
Introduce their characteristics:

name Chinese Features
linear linear kernel Generally speaking, use the linear kernel to run the model first. Don’t ask why, just ask and it will be faster.
poly polynomial kernel The polynomial kernel requires the degree attribute
rbf Gaussian radial basis kernel Generally speaking, Gaussian works well in any data set. (gauss gauss, yyds)
sigmoid sigmoid nucleus Generally speaking, no matter what data set, sigmoid is not good at running it.

The choice of kernel function is particularly important. If the kernel function is not suitable, it means that the sample is mapped to an unsuitable space, and the result will be collapse.

4. Soft Margin Support Vector Machine

The logic of the hard margin support vector machine is almost the same. On this basis, the soft margin allows the learner to misclassify some samples during the learning process or training. The constant C before the slack variable is more like the impact of the regularization parameter on logistic regression. In the breast cancer example below, we can also see that soft margin support vector machines have stronger generalization capabilities.

5. Support Vector Regression (SVR)

To briefly mention, the optimization problems of SVR and SVM are very similar. SVR is actually centered on the hyperplane and finds an interval band with a width of 2E as small as possible. At the same time, it is hoped that the interval band will cover all samples as much as possible. All included.

6. Code implementation

Take the breast cancer data in the sklearn dataset as an example.

from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC

Guess you like

Origin blog.csdn.net/weixin_52589734/article/details/113444644