【Machine Learning】From SVM to SVR

Note : In recent work, I have been exposed to the SVM model with high frequency, and there are cases where the SVM model is used for regression, that is, SVR. In addition, considering that it has been almost two years since I first learned about this model, it took me a lot of time to have a little intuitive understanding from the very beginning. So make a summary here, compare the difference between using the same model for classification and regression, and commemorate the second anniversary of the encounter with SVM! This summary will not involve too many formulas, but I just hope to have a more intuitive understanding of SVM through visualization.

 

0. Support vector machine (SVM)


The original SVM algorithm was invented in 1963 by Vladimir Vapnik and Alexei Zevanranges. In 1992, Bernhard E. Boser, Isabelle M. Guyon, and Vladimir Vapnik proposed a method to create nonlinear classifiers by applying the kernel trick to the maximum margin hyperplane. The predecessor of the current standard (soft spacing) was proposed by Corinna Cortes and Vapnik in 1993 and published in 1995.

In the 1990s, due to the decline of artificial neural networks (RNNs), SVM was the star algorithm for a long time. It is considered to be a theoretically beautiful and very practical machine learning algorithm.

In terms of theory, the SVM algorithm involves a lot of concepts: margin, support vector, kernel function, duality, convex optimization, etc. Some concepts are difficult to understand, such as the kernel trick and duality problems. In the application method, in addition to being used as a supervised classification and regression model, SVM can also be used for unsupervised clustering and anomaly detection. Compared with the more popular deep learning (suitable for solving large-scale nonlinear problems), SVM is very good at solving complex nonlinear problems with small and medium-sized training sets, and can even perform very well when there are more features than training samples. (Deep learning is prone to overfitting at this time). However, as the sample size $m$ increases, the computational complexity of the SVM model will increase in the form of $m^2$ or $m^3$.

In the examples below, the iris dataset mentioned in the previous blog is used .

Iris_data_set

Figure 1: Iris data set

 

1. The predecessor of SVM: Perceptron


The perceptron can be regarded as a low-profile linear SVM, which can be proved mathematically:

In linearly separable two classes of data, the perceptron can compute a straight line (or hyperplane) to completely separate the two classes in finite steps.

The closer the two classes are, the more steps are required. At this time, the perceptron is only guaranteed to give a solution, but the solution is not unique, as shown in the following figure:

 

 Figure 2: 3 different linear classifiers trained by the perceptron

 

1.1 A specific description of the binary classification problem

Training sample $x \in \mathbb{ R }^{n}$, label $y \in {-1, 1}$, for linear classifier:

  • 参数: $w \in \mathbb{ R }^n and b \in \mathbb{ R }$
  • Decision boundary: $w \cdot x + b = 0$
  • When classifying a new point $x$, the predicted label is $sign(w \cdot x + b)$

Referring to the above description, in the case of correct classification, if the label of a point $x$ is $y = 1$, the predicted value $w \ cdot x + b$ is also 1; the label is -1, the predicted value is also -1. Then $y (w \cdot x + b) > 0 $ can be used to uniformly represent the correct classification, and $y(w \cdot x + b) < 0 $ can be used to represent the wrong classification.

 

1.2 Cost function

When the classification is correct, that is, $y (w \cdot x + b) > 0$, $loss = 0$;

On misclassification, i.e. $y (w \cdot x + b) \leq 0$, $loss = -y (w \cdot x + b)$.

 

1.3 The flow of the algorithm

Using stochastic gradient descent to train the model, only one sample is used at a time, and the parameters are updated according to the gradient of the cost function,

step1: Initialize $w = 0, b = 0$;

step2: Loop to take samples from the training set, one at a time

           if $y (w \cdot x + b) \leq 0$ (the sample is misclassified):

               w = w + yx

               b = b + y

From the process point of view, each time a sample point is taken out to train the model, and the parameters are updated only in the case of wrong classification. Finally, when all samples are classified correctly, the model training process ends.

 

2. SVM - Linearly Separable


In the case that the two types of samples are linearly separable, the perceptron can guarantee to find a solution that can correctly distinguish the two types of samples. However, the solution is not unique, and the quality of these decision boundaries is not the same. Intuitively, the larger the interval on both sides of this line, the better. So is there a way to directly find this optimal solution? This is what linear SVMs do.

Intuitively, the more constraints, the more restrictive the model, and therefore the fewer solutions there are. The solution of the perceptron is not unique, so adding stronger constraints to the cost function of the perceptron seems to reduce the number of solutions. In fact it is.

2.1 Cost function of SVM

When the classification is correct, ie $y (w \cdot x + b) > 1$, $loss = 0$;

On misclassification, i.e. $y (w \cdot x + b) \leq 1$, $loss = -y (w \cdot x + b)$.

By comparison, you can find that the original $w \cdot x + b$ only needs to be greater than 0 or less than 0, but now it needs to be greater than 1 or less than 1. I don't have a very intuitive explanation for why 1 is chosen here, but there is a point Very important: What used to be a straight line is now a wide band. Two points with very small differences (for example, two points near $w \cdot x + b = 0$-0.001$ and $0.001$) can be divided into two different categories, but now they must differ by at least $\frac {2}{||w||}$ can be used, as shown in the following figure.

Figure 3: The maximum margin hyperplane obtained by training an SVM with a sample belonging to two classes. The sample points on the hyperplane are also called support vectors.

2.2 Decision Boundaries and Intervals

Figure 3 is from wiki. For the sake of unification, the decision boundary is still defined as $w \ cdot x + b = 0 $, and the boundaries on both sides (two dashed lines) are $ w \ cdot x + b = 1 $ and $ w \ cdot x + b = 1$, at this time only the sign of b is different and other properties are the same. Where $w, b$ are the parameters that need to be optimized during model training. The following information can be obtained from the above diagram:

  • The distance between the two dotted lines is $\frac{2}{||w||}$;
  • The direction of the parameter $w$ to be optimized is the normal vector direction of the decision boundary ($w$ is perpendicular to the decision boundary);
  • At this time, there are three points on the boundary, and these three points are the support vectors at this time.

Here is the procedure for calculating the distance between two dashed lines:

After expanding the vector representation of the decision boundary $w x + b = 0$, it can be obtained, $w1*x1 + w2*x2 + b = 0$.

Converting into intercept formula can be obtained, $x2 = - w1/w2 * x1 - b/w2$, so its slope is $-w1/w2$, and the intercept is $-b/w2$

The direction vector of the straight line is, $(1, -w1/w2)$ (you can get the value of y when x=1, b=0)

The normal vector of the line is $w = (w1, w2)$

Therefore, for the line $w x + b = 1$, the intercept formula is $x2 = - w1/w2 * x1 + (1 - b)/w2$, which is equivalent to an upward translation along the $x2$ axis After $\frac{1}{w_2}$, the distance between the line and $w \codt x + b = 0$ along the normal vector direction is $\gamma = \sqrt{\frac{1}{w_1^ 2 + w_2^2}} = \frac{1}{||w||}$, refer to Figure 4.

Figure 5: The width of the margin $\gamma$

2.3 Optimization goals

In SVM, the goal of optimization is to maximize the width of the margin $\gamma$, because $\gamma = \frac{1}{||w||}$, where $||w||$ is the parameter to be optimized $ The modulo length of w$. Therefore, the optimization objective is equivalent to minimizing $||w||$, which can be expressed as:

对于$(x^{(1)}, y^{(1)}), \ ..., \ (x^{(m)}, y^{(m)}) \in \mathtt{R^d} \times \{-1, 1\}$,$\min_{w \in \mathbb{R}^d, b \in \mathbb{R}}||w||^2$

st $y^{(i)}(w \cdot x^{(i)} + b) ≥ 1$ for all $i = 1, 2, ..., m$

 

The following is a comparison of the effect of using the perceptron and SVM to classify the setosa class and non-setosa in the iris dataset:

Figure 6: Perceptron Linear Classifier

 

Figure 7: Classification effect of linear SVM

Comparing Figure 6 and Figure 7, it can be seen that the margin around the decision boundary determined by SVM is larger, so when classifying more unknown samples, some points on the boundary can get more accurate classification results.

 

3. SVM - Linearly inseparable


 

As can be seen in Figure 1, the setosa category is linearly separable from the other two categories, but the virginica category coincides with the adjacent versicolor friendship shoe store, that is to say, it is linearly inseparable. At this time, SVM can still be used for classification. The principle is to add a slack variable (slack)$\xi$ to the cost function,

对于$(x^{(1)}, y^{(1)}), \ ..., \ (x^{(m)}, y^{(m)}) \in \mathtt{R^d} \times \{-1, 1\}$,$\min_{w \in \mathbb{R}^d, b \in \mathbb{R}}||w||^2 + C\sum_{i=1}^{m}{\xi^i} $

st $y^{(i)}(w \cdot x^{(i)} + b) ≥ 1 - \xi_i$ for all $i = 1, 2, ..., m$

After adding the slack variable to the above optimization objective, it can allow a certain degree of violation of the boundaries on both sides (controlled by C in the above formula), and allow a certain amount of misclassification, so as to separate the two types of data that were originally linearly inseparable.

The following is the classification effect of virginica and non-virginica when $C=1000$:

Figure 8: SVM classification effect after adding slack variables

As one of the hyperparameters of the SVM model, C needs to be screened step by step from a large range until the most suitable C is found. The larger the value of C, the greater the cost of misclassification, the more likely to reject misclassification, that is, hard margin; the smaller the value of C, the lower the cost of misclassification, the more tolerant of misclassification, that is, the soft margin. Even in the case of linear separability, if C is set very small, it may lead to the appearance of misclassification; in the case of linear inseparability, setting too large a value of C will cause the training to fail to converge.

 

4. SVR - Regression analysis using SVM


 

The Support Vector Regression (SVR) model uses SVM to fit curves and perform regression analysis. Classification and regression problems are the two most important classes of tasks in supervised machine learning. Unlike classification, where the output is a finite number of discrete values ​​(such as $\{-1, 1\}$ above), the output of a regression model is continuous in a certain range. Instead of considering different iris types, the length of the petals (equivalent to the independent variable x) is used to predict the width of the petals (equivalent to the dependent variable y).

In the figure below, 80% of all 150 samples were randomly taken as the training set:

Figure 9: Training samples for training the SVR model

Here is the regression line trained using linear SVR:

Figure 10: Regression line trained by SVR model

Just as SVM uses a band for classification, SVR also uses a band to fit the data. The width of this band can be set by yourself, using the parameter $epsilon$ to control:

Figure 11: Schematic diagram of the regression effect of the SVR model, where the dots with red circles represent support vectors

In the SVM model, the points on the boundary and the points that violate the margin inside the two boundaries are regarded as support vectors and play a role in subsequent predictions; in the SVR model, the points on the boundary and the points outside the two boundaries are regarded as support vectors , play a role in the prediction. According to the dual form, the final model is a linear combination of all training samples, and the weight of other points that are not support vectors is 0. The following supplements the graph of the cost function of the SVR model:

Figure 12: Cost function for soft margin SVR

As can be seen from Figure 12, the errors of these points inside the margin are all 0, and only the points that exceed the margin will calculate the error. So the task of SVR is to find a strip as narrow as possible to cover all the sample points.

 

 

Reference


https://zh.wikipedia.org/wiki/%E6%94%AF%E6%8C%81%E5%90%91%E9%87%8F%E6%9C%BA

https://zhuanlan.zhihu.com/p/26263309, Various Forms of Line Equations

https://github.com/ageron/handson-ml/blob/master/05_support_vector_machines.ipynb

http://www.svms.org/regression/SmSc98.pdf

http://www.robots.ox.ac.uk/~az/lectures/ml/

UCSanDiegoX: DSE220x Machine Learning Fundamentals

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325100577&siteId=291194637