Machine Learning 04: Support Vector Machines

This article is from the Sync Blog .

PS I don't know how Jianshu can display mathematical formulas and better typesetting content. So if you feel that the format below the article is messy, please jump to the above link by yourself. In the future, I will no longer take screenshots of mathematical formulas. After all, the layout of inline formula screenshots will be messy. Looking at the original blog address will have a better experience.

The previous article uses the KNN algorithm to solve the classification problem of machine learning. This article will introduce another so-called optimal algorithm for solving classification problems, called SVMSupport Vector Machine.

This article will not start with code immediately SVM, but start with theoretical knowledge to understand its principles, and finally practice and verify our understanding of knowledge with code.

Fundamental

SVMThe goal is to find a hyperplane that maximizes the gap between the hyperplane and the nearest sample point, so as to cut the sample points into different subspaces to obtain the effect of classification.

 
Comparison of cutting methods

Look at the three cutting methods for sample data in the same space listed in the picture above. The last one is SVMthe cutting method that meets the requirements.

Need to deepen your impression:

  1. In one-dimensional space, this hyperplane is a point;
  2. In two-dimensional space, this hyperplane is a line;
  3. In three-dimensional space, this hyperplane is a face;
  4. More dimensional space, hehe, I can't imagine...

If only relying on points, lines and surfaces can only process data with linearly separable features, and when the sample data cannot be linearly cut, is it SVMnot applicable? No, for this kind of scene, you can use some conversion functions to convert straight lines or planes into curves or surfaces for cutting.

SVMThis article is also the basic principle of our derivation from a simple scene that can be linearly cut . We'll talk about what can't be done with linear cuts later.

Mathematical derivation

In order to find SVMa hyperplane that meets the requirements, we need to describe it in mathematical language first. Please see the image below:

 
Mathematical representation of hyperplane

In this space, there are four sample points divided into two categories, the red ones represent the "+" class, and the yellow ones represent the "-" class. SVMThe solid line L in the figure is the hyperplane we found by definition in this space . The sample points A, B, C, D, etc. are all represented by vectors in the space. In particular, the points A, B, and C that fall on the dotted line exist as boundary points of the gap and will contribute to the calculation of the hyperplane, so become Support Vector. This should be Support Vector Machinethe origin of the name.

Now we define a vector $-\vec{w}-$, which is perpendicular to the hyperplane L and is the normal vector of L. Then use the normal vector $-\vec{w}-$ to describe the hyperplane L as:

$$ \ vec {w} ^ {T} (\ vec {x} - \ vec {x_0}) = 0 $$

Where $-\vec{x_0}-$ is the intersection of $-\vec{w}-$ and the hyperplane. Let $-b = -\vec{w}\vec{x_0}-$, L can be expressed as:

$$Formula 1: \vec{w}^{T}\vec{x} + b = 0$$

For a new data point $-\vec{u}-$, if the point falls on the upper right of the hyperplane L, the data point is judged as "+"; if it falls on the lower left of L, it is judged as "-" "kind. That is, the decision rule is:

$$
Rule: \begin{cases}
\vec{w}^{T}\vec{u} + b > 0, & \mbox{if }\vec{u}\mbox{ is class '+'} \\
\vec{w}^{T}\vec{u} + b < 0, & \mbox{if }\vec{u}\mbox{ is class '-'}
\end{cases}
$$

Our goal is to determine the hyperplane L. But since the length of $-\vec{w}-$ is unclear, $-b-$ is also affected and cannot be determined. They can have too many combinations. So next, we need to constrain the value range of $-\vec{w}-$ and $-b-$ through sample data to provide a basis for our solution.

Originally for all sample points $-x_i-$, satisfy:

$$
\begin{array}{lcl}
\vec{w}^{T}\vec{x_i} + b > 0,& \mbox{if }\vec{x_i}\mbox{ is class '+'} \\
\vec{w}^{T}\vec{x_i} + b < 0,& \mbox{if }\vec{x_i}\mbox{ is class '-'}
\end{array}
$$

In order to increase the constraints to limit the value range of $-\vec{w}-$, we translate the hyperplane to the upper right and lower left, respectively, until it reaches the dashed line. And the expressions that limit the upper right dashed line and the lower left dashed line are: $-\vec{w}^{T}\vec{x_i} + b = 1-$ and $-\vec{w}^{T}\vec{x_i } + b = -1-$. There must be a combination of $-\vec{w}-$ and $-b-$ that can satisfy both the hyperplane and the hyperplane expressions at the positions of the two dashed lines. Then all sample points $-x_i-$ will satisfy the following inequality:

$$
式子2: \begin{cases}
\vec{w}^{T}\vec{x_i} + b \ge 1,& \mbox{if }\vec{x_i}\mbox{ is class '+'} \\
\vec{w}^{T}\vec{x_i} + b \leq -1,& \mbox{if }\vec{x_i}\mbox{ is class '-'}
\end{cases}
$$

This inequality group uses the known condition that the sample point has been determined to belong to the class to constrain the value range of $-\vec{w}-$. Choosing "1" and "-1" is a mathematical trick that makes subsequent calculations more convenient, and it does not affect the calculation process.

Next, introduce the categorical variable $-y_i-$:
$$
Formula 3: \begin{cases}
y_i = 1, & \mbox{if }\vec{x_i}\mbox{ is class '+'} \\
y_i = -1, & \mbox{if }\vec{x_i}\mbox{ is class '-'}
\end{cases}
$$

Multiplying the two equations of Equation 2 and Equation 3 respectively, and moving the 1 on the right to the left, a unified inequality can be obtained:

$$Formula 4: y_i(\vec{w}^{T}\vec{x_i} + b) - 1 \ge 0$$

This inequality is only valid for sample points (eg: A, B, C) that fall on the edge of the SVM segmentation gap (ie, the dotted line in the figure).

At this point, please return to SVMthe principle: finding the hyperplane that maximizes the gap as the decision boundary for classification. We need a computational function to determine the width of the splitting gap. Assuming that samples $-\vec{x_+}-$ and $-\vec{x_-}-$ are taken at the two edges of the gap (two dotted lines), the corresponding type value is $-y_+ = 1-$ , $-y_- = -1-$, you can get:
$$
\begin{align}
width &= \frac{\vec{w}^{T}}{|\vec{w}|}(\vec{ x_+} - \vec{x_-}) \\
&= \frac{1}{|\vec{w}|}(\vec{w}^{T}\vec{x_+} - \vec{w }^{T}\vec{x_-}) \\
&= \frac{1}{|\vec{w}|}(\frac{1-b}{y_+} - \frac{1-b} {y_-})\\
&= \frac{1}{|\vec{w}|}(\frac{1-b}{1} - \frac{1-b}{-1})\\
& = \frac{2}{|\vec{w}|}
\end{align}
$$

In order to maximize the width of the gap, it can be derived:

$$
式子5:\begin{align}
& Maximize(width) \\
&\Leftrightarrow Maximize(\frac{2}{|\vec{w}|}) \\
&\Leftrightarrow Minimize(|\vec{w}|)\\
&\Leftrightarrow Minimize(\frac{1}{2}|\vec{w}|^2)\\
&\Leftrightarrow Minimize(\frac{1}{2}\vec{w}^{T}\vec{w})
\end{align}
$$

With this basis, using the Lagrange multiplier method to solve the extreme value problem under the inequality constraints , we can construct such a function:

$$
L(\vec{w}, b, \vec{\alpha}) = \frac{1}{2}\vec{w}^{T}\vec{w} - \sum_{i}{n}{\alpha_i[y_i(\vec{w}{T}\vec{x_i})-1]}, \alpha_i \ge 0, i = 1,2...n
$$

The goal of the problem here is to find $-\vec{w}-$, $-b-$, and $- when $-L(\vec{w}, b, \vec{\alpha})-$ is minimized \vec{\alpha}-$, so first find partial derivatives for $-\vec{w}-$ and $-b-$:

$$
\frac{\partial L}{\partial \vec{w}} = \vec{w} - \sum_{i}^{n}{\alpha_i y_i \vec{x_i}},
\frac{\partial L}{\partial b} = - \sum_{i}^{n}{\alpha_i y_i}
$$

Let the partial derivative function equal to 0, we have:

$$
\vec{w} = \sum_{i}^{n}{\alpha_i y_i \vec{x_i}}, \sum_{i}^{n}{\alpha_i y_i} = 0
$$

Here, instead of finding the partial derivative of $-\vec{\alpha}-$, substitute the above two equations back to the Lagrangian function L, so that a }-$ function:

$$
L(\vec{\alpha}) = \sum_{i}^{n}{\alpha_i} - \frac{1}{2}\sum_{i}{n}{\sum_{j}{n}{\alpha_i \alpha_j y_i y_j \vec{x_i}^{T} \vec{x_j}}}, \sum_{i}^{n}{\alpha_i y_i} = 0, \alpha_i,\alpha_j \ge 0, i,j = 1,2...n
$$

According to the idea of ​​dual problem, the above minimization problem can be transformed into $-\alpha-$ that maximizes $-L(\vec{\alpha})-$ (refer to KKT conditional problem ).

Add a negative sign to both sides of the above formula to further convert the problem: find the $-\alpha-$ that minimizes $- -L(\vec{\alpha})-$. At this time there is:

$$
F(\vec{\alpha}) = -L(\vec{\alpha}) = \frac{1}{2}\vec{\alpha}_{T}\begin{bmatrix}
y_{1}y_{1}\vec{x_1}^{T}\vec{x_1} & y_{1}y_{2}\vec{x_1}^{T}\vec{x_2} & \cdots & y_{1}y_{n}\vec{x_1}^{T}\vec{x_n} \\
\vdots & \vdots & \ddots & \vdots \\
y_{n}y_{1}\vec{x_n}^{T}\vec{x_1} & y_{n}y_{2}\vec{x_n}^{T}\vec{x_2} & \cdots & y_{n}y_{n}\vec{x_n}^{T}\vec{x_n} \\
\end{bmatrix}\vec{\alpha}

  • \begin{bmatrix}
    -1 & -1 & \cdots & -1 \\
    \end{bmatrix} \vec{\alpha}, \vec{y}^{T}\vec{\alpha} = 0, \vec{\alpha} \ge 0
    $$

Aha, this is a form of quadratic programming that can be solved using quadratic programming. We need to solve the value of $-\alpha-$ that minimizes $-F(\alpha)-$, and then we can get $-\vec{w}-$ and $-b-$ through $-\alpha-$ .

Here is a scipycode example that uses solving quadratic programming:

import numpy as np
from scipy import optimize
# 形如:F = (1/2)*x.T*H*x + c*x + c0 # 约束条件:Ax <= b # 现在假设已知参数如下: H = np.array([[2., 0.], [0., 8.]]) c = np.array([0, -32]) c0 = 64 A = np.array([[ 1., 1.], [-1., 2.], [-1., 0.], [0., -1.], [0., 1.]]) b = np.array([7., 4., 0., 0., 4.]) # 设置初始值 x0 = np.random.randn(2) def loss(x, sign=1.): return sign * (0.5 * np.dot(x.T, np.dot(H, x))+ np.dot(c, x) + c0) def jac(x, sign=1.): return sign * (np.dot(x.T, H) + c) cons = {'type':'ineq', 'fun':lambda x: b - np.dot(A,x), 'jac':lambda x: -1 * A} opt = {'disp':False} res_cons = optimize.minimize(loss, x0, jac=jac,constraints=cons, method='SLSQP', options=opt) print(res_cons) 

See the source code on Github .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325253486&siteId=291194637
Recommended