Support vector machine (SVM) - Principle articles

SVM Introduction

SVM (support vector machines, SVM) is a binary model which basic model is defined in the feature space maximum spaced linear classifiers, perceptron distinguish it from the maximum interval; further comprising the SVM kernel trick , making it essentially a linear classifier. SVM's learning strategy is to maximize the interval, can be formalized as a problem solving convex quadratic programming, it is also equivalent to a minimization problem regularization of the hinge loss function. The SVM learning algorithm is an optimization algorithm for solving convex quadratic programming.

 

SVM algorithm principle

The basic idea of SVM is to solve able to correctly divide the training data set and geometric maximum interval separating hyperplane. As shown below,  [official] is the separating hyperplane for the linearly separable data sets, so that there are infinitely many hyperplanes (i.e. perceptron), but the geometric maximum interval separating hyperplane is unique.

Before derivation, to some definitions. Assuming that a given training data set on a feature space

[official]

Wherein,  [official] ,  [official] ,  [official] for the first  [official] feature vector  [official] a class label, when it is positive cases is equal to +1; -1 when negative example. Further assume that the training data set is linearly separable.

Geometry of intervals: for a given set of data  [official] and the hyperplane  [official] , a hyperplane defined sample points on  [official]the geometric interval is

[official]

Hyperplane minimum geometric spacing on all sample points of

[official]

In fact, this is the distance from the hyperplane to our so-called support vector.

Based on the above definition, solving the biggest issue dividing hyperplane SVM model can be expressed as the following constrained optimization problem

[official]

[official]

The constraints on both sides while divided  [official] , to give

[official]

Because  [official] all scalar, so in order expression of brevity, so

[official]

[official]

get

[official]

And because maximization  [official] is equivalent to the maximization  [official] , it is equivalent to minimizing  [official] (  [official] to the back after derivation concise form, does not affect the result), so SVM model to solve the biggest problem has split hyperplane can be expressed as the following constraints most Optimization

[official]

[official]

This is a convex quadratic programming problem contains inequality constraints, we can get it (dual problem) the dual problem of its use Lagrange multipliers.

First of all, we will have the constraints of the original objective function converted to the new structure unconstrained Lagrangian objective function

[official]

Which  [official] is the Lagrange multiplier, and  [official] . Now we make

[official]

When the sample point does not satisfy the constraint conditions, i.e. outside the feasible solution area:

[official]

In this case,  [official] set to infinity,  [official] it is infinite.

When this point is full constraints are met, i.e. in the feasible region:

[official]

In this case,  [official] the original function itself. So, the two cases can be combined to get our new objective function

[official]

So the problem is equivalent to the original constraint

[official]

Look at our new objective function, to seek the maximum value, and then seek a minimum. In this case, we first have to face with the need to solve the parameters  [official] and  [official] equations, which  [official] is inequality constraints, this solution process is not good to do. So, we need to use the Lagrangian duality, will exchange about the minimum and maximum position, thus becomes:

[official]

Have a  [official] need to meet two conditions:

① optimization problem is convex optimization problem

② meet the KKT conditions

First, the optimization problem is a convex optimization problem is clearly, therefore satisfies a condition, and to meet the two conditions, i.e. the requirements

[official]

In order to obtain a specific form to solve the dual problem, so  [official] on  [official] and  [official] the deflector is 0, available

[official]

[official]

The above two equations into the Lagrangian objective function, elimination  [official] and  [official] to give

[official]

[official]

which is
[official]

Seeking  [official] to  [official] great, that is, the dual problem

[official]

[official]

[official]

The goal of formula a minus sign, will be converted to solving a very small APPROACHING

[official]

[official]

[official]

Our optimization problem now becomes the above form. For this problem, we have a more efficient optimization algorithm, that sequence minimal optimization (SMO) algorithm. Here temporarily deployed more details on using SMO algorithm for solving optimization problems, coupled with the next article detailed derivation.

We can get through this optimization algorithm  [official] , and then based on  [official] that we can solve for  [official] and  [official] thus achieve our original purpose: to find the hyperplane that "decision plane."

Derivation assumed to satisfy the foregoing are established under the KKT conditions, KKT conditions are as follows

[official]

Further, according to the previous derivation, there are established the following two formulas

[official]

[official]

It can be seen in  [official] , at least there is a  [official] (reductio ad absurdum to prove, if all zeros, the  [official] contradiction), which  [official] has

[official]

So you can get

[official]

[official]

For any training sample  [official] , there is always  [official] or  [official] . If  [official] , then the sample does not appear in the final formula to solve the model parameters. If  [official] , certainly has  [official] , corresponding to the maximum sample point located on the interval boundaries, it is a support vector. This shows an important property of SVM: After training, most of the training samples do not need to retain the final model only supports vector-related.

Here the training data are based on the assumption for linearly separable, but the data is linearly separable almost complete absence of actual case, in order to solve this problem, the concept of "soft spacer", i.e. allow certain point does not satisfy the constraint

[official]

Using hinge loss, the original optimization problem is rewritten as

[official]

[official]

[official]

Wherein  [official] the "slack variables" ,  [official] i.e. a hinge loss function. Each sample has a corresponding slack variable characterizing the degree of the sample does not satisfy the constraint. [official] It called the penalty parameter,  [official] the larger the value, the greater the punishment for classification. Consistent with the idea of solving linear separability, also here to get Lagrangian with Lagrange multipliers, and then seek its dual problem.

Based on the above discussion, we can get a linear support vector machine learning algorithm is as follows:

Input: training data set  wherein [official]   , ;[official][official]

Output: separating hyperplane decision function and classification

(1) Select penalty parameter  [official] , construction and solving convex quadratic programming problems

[official]

[official]

[official]

The optimal solution [official]

(2) Calculation

[official]

Selecting  [official] the one component  [official] satisfies the condition  [official] is calculated

[official]

(3) separating hyperplane required

[official]

Classification decision function:

[official]

 

Nonlinear SVM algorithm principle

For nonlinear classification problems input space, it can be non-linear transformation into a linear dimension of a classification feature space, linear support vector machine learning in the high dimensional feature space. Because of the dual problem of linear support vector machine learning, the objective function and classification decision function only involves the inner product between instances and examples, it is not necessary to explicitly specify the nonlinear transformation, but with replacement within the kernel function among product. Kernel Function, the inner product between two instances after passing through a nonlinear conversion. Specifically,  [official] a function, or a positive definite nucleus, means that there is a mapping from the input space to the feature space  [official] , an arbitrary input space  [official] , there

[official]

Dual problem in linear support vector machine learning, with a core function  [official] within an alternative product, get solved is non-linear support vector machine

[official]

Based on the above discussion, we can get non-linear support vector machine learning algorithm is as follows:

Input: training data set  wherein [official]   , ;[official][official]

Output: separating hyperplane decision function and classification

(1) selecting a suitable kernel function  [official] and penalty parameter  [official] , configuration and Convex Quadratic Programming

[official]

[official]

[official]

The optimal solution [official]

(2) Calculation

Selecting  [official] the one component  [official] satisfies the condition  [official] is calculated

[official]

(3) classification decision function:

[official]

 

Introduce a common core function - Gaussian kernel

[official]

Corresponding to the radial basis function is a Gaussian SVM classifier, in this case, the classification decision function

[official]

 

reference

[1] "statistical learning methods" Lee Hang

[2] "machine learning" Zhou Zhihua

[3] A Python 3 "machine learning real" study notes (VIII): Shredded linear support vector machine principle articles of SVM  Jack-Cui

[4] in-depth understanding of Lagrange multiplier method (Lagrange Multiplier) and KKT conditions

[5] Support Vector Machine Introduction popular (SVM appreciated that the three-state)

[6]Support Vector Machines for Classification

Guess you like

Origin www.cnblogs.com/klausage/p/12575064.html