Machine learning support vector machine (hand push formula version)

Article directory

foreword

Support vector machine $(S u pp or t$ $V ec t or$ $M a c hin e, SVM)$ originated from statistical learning theory, is a binary classification model, and is the most concerned algorithm in machine learning. Yes, it is the "most", not one of $them$ $.$

1. Margins and support vectors

The core idea of the support vector machine classification method is to find a hyperplane in the feature space as the decision boundary to divide the samples into positive and negative classes, and make the generalization error of the model on the unknown data set as small as possible.

Hyperplane: In geometry, a hyperplane is a subspace of a space, which is a space whose dimension is one less than the space in which it resides. If the data space itself is three-dimensional, its hyperplane is a two-dimensional plane, and if the data space itself is two-dimensional, its hyperplane is a one-dimensional line.

insert image description here

For example, the above set of data, we can easily draw a line to divide the above data into two categories, and the error is zero. For a data set, there may be many hyperplanes with an error of 0, such as the following:

insert image description here
But such a model cannot guarantee good generalization performance, that is, it cannot guarantee that this hyperplane also performs well on unknown data sets. So we introduce a noun ------ 间隔 $(ma r g in)$ is to translate the hyperplane we found to both sides until it stops at the sample point closest to the hyperplane to form two new hyperplanes. The distance between the two hyperplanes is It is called "interval", and the hyperplane is in the middle of this "interval", that is, the distance between the hyperplane we choose and the two new hyperplanes after translation is equal. The few sample points closest to the hyperplane are called支持向量 $(s u pp or t$ $v ec t or)$ 。

insert image description here

Comparing the above two figures, intuitively, the samples can be divided into two categories, but if some noise is added to it, it is obvious that the blue hyperplane has the best tolerance to local disturbances, because It's "wide enough", if you can't imagine it, look at the following example:
insert image description here
Obviously, after introducing some new data samples, $B_1$ This hyperplane error is still 0, and the classification result is the most robust, $B_2$ This hyperplane has a classification error because of the small interval. Therefore, when we are looking for a hyperplane, we hope that the larger the interval, the better.
The above is the support vector machine, that is, 通过找出间隔最大的超平面，来对数据进行分类the classifier of .
The model of support vector machine can be divided into the following three types from simple to complex:
$\bullet$ Linearly Separable Support Vector Machines
$\bullet$ Linear Support Vector Machines
$\bullet$ Nonlinear SVM
Maximization via hard margin $(ha r d$ $ma r g in$ $max x imi z a t i o n)$ , learn a linear classifier, that is, linearly separable support vector machine, also known as hard margin support vector machine; when the training data set is approximately linearly separable, maximize the $(so f t$ $ma r g in$ $max x imi z a t i o n)$ , also learn a linear classifier, that is, linear support vector machine, also known as soft margin support vector machine; when the training data set is linearly inseparable, by using the kernel technique (kernel ( $(k er n e l$ $t r ck k) and$ soft margin maximization, learning nonlinear support vector machines.

Simplicity is the foundation of complexity, and it is also a special case of complexity.

2. Function equation description

insert image description here

$D=\{(x_1,y_1$ ) $),(x_2,y_2),\dots,(x_n,y_n)\},y_i \in \{-1,+1\}$ , in the above sample space, any line can be expressed as: $\bm w^T\bm x+b=0$ 其中 $\bm w=(w_1, w_2,\dots,w_d)^T$ is the normal vector, which determines the direction of the hyperplane; $b$ is the displacement term, which determines the distance between the hyperplane and the origin. Obviously, the hyperplane can also be normal vector $\bm w$ and displacement $bOK$ .
For the convenience of derivation and calculation, we make the following regulations:
points above the hyperplane are marked as positive, and points below the hyperplane are marked as negative, that is, for $(x_i,y_i)\in D$ ，若 $y_i=+1$ ，则有 $\bm w^T\bm x_i+b>0$ ；若 $y_i=-1$ ，则有 $\bm w^T\bm x_i+b<0$ ，表达式如下： $\begin{cases} \bm w^T\bm x_i+b\geq+1, & y_i=+1\\ \\ \bm w^T\bm x_i+b\leq-1, & y_i=-1 \end{cases}$ Among them, +1和-1表示两条平行于超平面的虚线到超平面的相对距离.
Then, any point $\bm x in the sample space$ to the hyperplane can be written as: $r=\frac {|\bm w^T+b|} {||\bm w||}$ From this, the sum of the distances from the support vectors of two different labels to the hyperplane, that is, the interval, can be expressed as: $\gamma=\frac {2} {||\bm w||}$ Our goal is to find the hyperplane with the largest interval, that is, the parameter $\bm w that satisfies the following constraints$ and $b$ , such that $\gamma$ 最大，即 $\underset {\bm w,b} {max} \frac {2} {||\bm w||} \\[3pt] subject\ to \ y_i(\bm w^T\bm x_i+b)\geq1,i=1,2,\dots,n$ Obviously, maximizing the interval $\gamma$ only needs to be minimized $||\bm w||$ is enough, so the constraints become as follows: $\underset { \bm w,b} {min} \frac {1} {2}||\bm w||^2 \\[3pt] subject\ to \ y_i(\bm w^T\bm x_i+b)\geq1 ,i=1,2,\dots,n$

In fact, it is to take the reciprocal, so why add the square, as I said before, $L_2$ Paradigm Well, adding a square is to eliminate the operation of square root and simplify the calculation process.

3. Parameter solution

To solve optimization problems with constraints, it is commonly used to introduce the Lagrange multiplier $\lambda$ constructs the Lagrangian function. $(standard\ Lagrange\ multiplier\ method)$ in the second volume of high mathematics. $(s t and d a r d L a g r an g e m u lt i pl i er m e t h o d)$ , let's briefly review it below.

3.1 Lagrangian multipliers

To find the function $z = f (x, y)$ in the additional condition $\varphi(x,y)=0$ For the possible extreme points under $0$ $+\lambda \varphi(x,y)$ where, $\lambda$ is a parameter, find its pair $x$ 、 $y$ 和 $\lambda$ The first partial derivative of $λ$ $\begin{cases} f_x(x,y)+\lambda \varphi_x(x,y)=0\\ \\ f_y(x,y)+\lambda \varphi_y( x,y)=0\\ \\ \varphi(x,y)=0\\ \end{cases}$ With this system of equations solve $x$ 、 $y$ 和 $\lambda$ , the obtained $(x, y)$ is the function $f (x, y)$ in the additional condition $\varphi(x,y)=0$ Possible extreme points below $0 .$
If the function has more than two independent variables and more than one additional condition, for example, the function $u = f (x, y, z, t)$ under the additional condition $\varphi (x,y,z,t)=0 \\[3pt] \psi ( x,y,z,t)=0$ , you can first make the Lagrangian function $L(x,y,z,t)=f(x,y,z,t)+\lambda f(x,y,z,t)+\mu f(x, y, z, t)$ among them, $\lambda,\mu$ is a parameter, find its pair $x$ 、 $y$ 、 $z$ 、 $t$ 、 $\lambda$ 和 $\mu$ 's first-order partial derivative, and make it zero, then solve the simultaneous equations to get $(x, y, z, t)$ 。

Come, let's look at a small problem:
find the function $u=x^2+y^2+z^2$ in the constraints $z=x^2+y^2$ 和 $x + y + z =$ The maximum and minimum values under $4 .$

3.2 Lagrangian dual function

Convex optimization problems: the function itself is quadratic $(q u a d r a tic)$ , the constraints of the function are linear under its parameters, such a function is called a convex optimization $problem$ $.$

First construct the Lagrange function of the support vector machine, that is, the loss function: $L(\bm w,b,\bm \alpha)=\frac {1} {2}||\bm w||^2+\sum_{i=1}^ m\alpha_i\bigg(1-y_i\big(\bm w^T\bm x_i+b\big)\bigg)\ (\alpha_i \geq0)$ among them， $\alpha=(\alpha_1,\alpha_2,\dots,\alpha_n)^T$ _
It can be seen that the Lagrange function is divided into two parts: the first part is the same as our original loss function, and the second part expresses our constraints. We hope that the constructed loss function can not only represent our original loss function and constraints, but also express that we want to minimize the loss function to solve $\bm w$ and $The intention of b$ , so we have to start with $\alpha$ is a parameter, solve $L(\bm w,b,\alpha)$ , then $\bm w$ and $b$ is a parameter, solve $L(\bm w,b,\alpha)$ minimum value. Therefore, our goal can be written as follows: $\underset {\bm w,b} {min}\ \underset {\alpha_i \geq0} {max} \ L(\bm w,b,\alpha)\ (\alpha_i \geq0)$