Support vector machines (1)

SVM is essentially a search for a hyperplane in the feature space among the training samples that exceed plus or minus two types of distance from the plane to the maximum, which is to maximize the interval mentioned here so that all sample points from the ultra-distance from the plane of the smallest among. We know that perception is a basic machine SVM, but the perception among the machines, and there is no requirement to maximize the interval, but only able to find a linearly separable data separate hyperplane. In addition, as may be used in the SVM kernel trick, so in essence, is a nonlinear SVM classifier. From simple to complex, SVM can be divided into three types.

1, linearly separable SVM, also known as a hard interval SVM

2, linear SVM, also known as soft margin SVM

3, non-linear SVM

These three models have a more universal than a former one is a special case of the latter

 

A linear SVM linearly separable is separable SVM for separable data set linear design of a support vector machine. Refers to the so-called linearly separable, there is a hyperplane training samples may be completely separate. It should define two types of function of the distance from the distance and geometry.

1, the distance function is defined as follows, for any sample point $ (x_ {i}, y_ {i}), i = 1,2,3, ..., N $

$\hat{\gamma}_{i}=y_{i}*(w*x_{i}+b)$

Obviously if the sample point is then negative type y = -1, $ wx + b $ should be less than zero; if it is n type, above all greater than zero, such that are positive for the definition of a function of distance.

However, a problem is a function of distance, the size is not fixed, assuming that changes in our team w and b equal proportions, the proportional change in function of the distance $ \ hat {\ gamma} $ would like, but substantially before and w, b represents the same hyperplane. How to overcome this problem, so the geometric distance came into being.

2, the geometric distance

$\gamma_{i} = y_{i}*(\frac{w}{|w|}*x_{i}+\frac{b}{|w|})$

In this case there would be the coefficient of x telescopic possibility should be set to 1 as the first of its length.

3, the formal definition of linearly separable SVM - original form

According to the above description, we are looking for a minimal geometric distance, the geometric distance is the smallest among all samples from the point that we define as

$\gamma$

Distance is defined as a function of the corresponding

$ \ Hat {\ gamma} $

So linearly separable SVM can be formalized

$ Argmax (\ gamma) $

$s.t. y_{i}*(\frac{w}{|w|}*x_{i}+\frac{b}{|w|}) \ge \gamma,i=1,2,3,...,N$

We may find that the constraints are too complicated, because each constraint has a logarithmic norm of w, be reduced to the following

$argmax \frac{\hat{\gamma}}{|w|}$

$s.t. y_{i}*(w*x_{i}+b) \ge \hat{\gamma},i=1,2,3,...,N$

In the objective function which we found as a function of distance there, but we know has a function of distance from the size of uncertainty, we can define it as a fact, then the distance function can be transformed into the distance scale factor is multiplied by a certain .

Also, because $ argmax (\ frac {1} {| w |}) $ and is $ argmin (\ frac {1} {2} | w | ^ {2}) $ is equivalent to the square, so the above model defined as transformation

$argmin \frac{1}{2}|w|^{2}$

$s.t. y_{i}*(w*x_{i}+b) \ge 1,i=1,2,3,...,N$

In fact, we get the original linear representation of separable SVM

4, the linear SVM for separable dual representation

According to Lagrange duality, we can build the original problem of the dual problem. First, an Lagrange function

$L(w,b,\alpha)=\frac{1}{2}|w|^{2}-\sum_{i=1}^{N}\alpha_{i}*(y_{i}*(w*x_{i}+b)-1)$——(1)

According to minimize the problem of duality, the original problem, actually corresponds to the maximum and minimum issue here for the Lagrangian, namely

$argmax_{\alpha}argmin_{w,b}L(w,b,\alpha)$

So for the dual problem

$argmax_{\alpha}argmin_{w,b}L(w,b,\alpha)$

$s.t. \alpha_{i} \ge 0,i=1,2,3,...,N$

(1) The first is a very small problem

We partial derivative of the above formula (1) can be obtained

$\nabla_{w}=w-\sum_{i=1}^{N}(\alpha_{i}*y_{i}*x_{i})$————(2)

$\nabla_{b}=-\sum_{i=1}^{N}(\alpha_{i}*y_{i})$————(3)

The above order (1) (2) to zero, we get

$w=\sum_{i=1}^{N}(\alpha_{i}*y_{i}*x_{i})$————(4)

$\sum_{i=1}^{N}(\alpha_{i}*y_{i})=0$————(5)

We (4) (5) into two formula (1) which can be obtained

$argmin_{w,b}=-\frac{1}{2}|w|^{2}+\sum_{i=1}^{N}\alpha_{i}$

(2) great problem

Due to the above problems have been solved very small, so the problem depends on solving great problems

$argmax_{\alpha}(-\frac{1}{2}|w|^{2}+\sum_{i=1}^{N}\alpha_{i})$

$s.t. \alpha_{i} \ge 0,i=1,2,3,...,N$

Also change my

$argmax_{\alpha}(\frac{1}{2}|w|^{2}-\sum_{i=1}^{N}\alpha_{i})$

$s.t. \alpha_{i} \ge 0,i=1,2,3,...,N$

$\sum_{i=1}^{N}(\alpha_{i}*y_{i})=0$

This is the definition of the dual problem.

For the original problem to get hyperplane defined, very simple, but how defined here, because here is the optimization parameters $ \ alpha $

KKT condition needs to be used, w is defined above, for the need to find a b $ \ alpha_ {i} $ sample points are not obtained 0

 

reference

[1] "statistical learning methods," Li Hang

Guess you like

Origin www.cnblogs.com/lightblueme/p/12593752.html