[Statistical Learning|Book Reading] Chapter 7 Support Vector Machine p95-p133

train of thought

Support vector machine is a two-category model. Its basic model is a linear classifier with the largest interval defined in the feature space. The largest interval makes it different from the perceptron. Support vector machines also include kernel techniques, which make them essentially non-linear classifiers. The learning strategy of the support vector machine is to maximize the interval, which can be formalized as a solution to a convex quadratic programming problem, which is also equivalent to the problem of minimizing the regularized hinge loss function. The learning algorithm of support vector machine is an optimization algorithm for solving convex quadratic programming.


线性可分
硬间隔支持向量机
近似线性可分
软间隔支持向量机
非线性可分
核技巧+软间隔支持向量机

When the input space is a Euclidean space or a discrete set, and the feature space is a Hilbert space, the kernel function represents the inner product between the feature vectors obtained by mapping the input from the input space to the feature space. Nonlinear support vector machines can be learned by using kernel functions, equivalent to implicitlyLearning Linear Support Vector Machines in High-Dimensional Feature Spaces, such methods are called kernel tricks.

Fast Learning Algorithms for Support Vector Machines and Kernel Functions - Sequential Minimal Optimization Algorithm (SMO)

Support Vector Machines

The simplest case of a support vector machine is a linearly separable support vector machine, or a hard-margin support vector machine. The condition for constructing it is that the training data is linearly separable, and its learning strategy ismaximum interval method. It can be expressed as a convex quadratic programming problem, and its original optimization problem is: min ⁡ w , b 1 2 ∥ w ∥ 2 \min_{w,b} \frac{1}{2} \left \| w \right \ | ^2w,bmin21w2
s . t . y i ( w ∗ x i + b ) − 1 ≥ 0 , i = 1 , 2 , . . . , N s.t. y_i(w*x_i+b)-1\ge 0,i=1,2,...,N s.t.yi(wxi+b)10,i=1,2,...,N
obtains the solution to the optimization problem asw ∗ w^*w, b ∗ b^* b , to obtain a linearly separable support vector machine, the classification hyperplane isw ∗ ∗ x + b ∗ = 0 w^**x+b^*=0wx+b=0
classification decision function isf ( x ) = sign ( w ∗ ∗ x + b ∗ ) f(x)=sign(w^**x+b^*)f(x)=sign(wx+b)

In the maximum interval method, functional interval and geometric interval are important concepts.

The optimal solution of linearly separable support vector machine exists and is unique, and the instance points on the boundary of the interval are support vectors, and the optimal separating hyperplane is completely determined by the support vectors.

The dual problem of the quadratic programming problem is: min 1 2 ∑ i = 1 N ∑ j = 1 N α i α jyiyj ( xi ∗ xj ) − ∑ i = 1 N α i min \frac{1}{2} \sum_ {i=1}^{N}\sum_{j=1}^{N} \alpha_i \alpha _j y_iy_j(x_i*x_j)-\sum_{i=1}^{N} \alpha _imin21i=1Nj=1Naiajyiyj(xixj)i=1Nai
s . t . ∑ i = 1 N α i y i = 0 s.t. \sum_{i=1}^{N} \alpha _iy_i=0 s.t.i=1Naiyi=0
α i ≥ 0 , i = 1 , 2 , . . . , N \alpha _i\ge 0,i=1,2,...,N ai0,i=1,2,...,N

Usually, linear separable support vector machines are learned by solving the dual problem, that is, firstly solve the optimal value of the dual problem α ∗ \alpha^*a , and then solve the optimal valuew ∗ w^*w andb ∗ b^*b , get the separation hyperplane and classification decision function

Linear Support Vector Machines

In reality, the training data is rarely linearly separable, and the training data is often approximately linearly separable. In this case, a linear support vector machine or a soft margin support vector machine is used. Linear support vector machine is the most basic support vector machine.
For noise or exceptions, by introducing slack variables ξ i \xi_iXi, make it 'separable', and get the convex quadratic programming problem learned by linear support vector machine, the original optimal problem is: min ⁡ w , b , ξ 1 2 ∥ w ∥ 2 + C ∑ i = 1 N ξ i \min_{w,b,\xi} \frac{1}{2} \left \| w \right \| ^2+C\sum_{i=1}^{N}\xi_iw , b , xmin21w2+Ci=1NXi
s. t. yi ( w ⋅ xi + b ) ≥ 1 − ξ i , i = 1 , 2 , . . . , N st y_i(w·x_i+b)\ge 1-\xi_i,i=1,2,...,Ns.t.yi(wxi+b)1Xi,i=1,2,...,N
ξ i ≥ 0 , i = 1 , 2 , . . . , N \xi_i\ge0,i=1,2,...,NXi0,i=1,2,...,N
obtains the solution to the optimization problem asw ∗ w^*w, b ∗ b^* b , to obtain a linearly separable support vector machine, the classification hyperplane isw ∗ ∗ x + b ∗ = 0 w^**x+b^*=0wx+b=0
classification decision function isf ( x ) = sign ( w ∗ ∗ x + b ∗ ) f(x)=sign(w^**x+b^*)f(x)=sign(wx+b )
The solution of linearly separable support vector machinew ∗ w^*w Unique, butb ∗ b^*b不孪动。
对做 problem is:min 1 2 ∑ i = 1 N ∑ j = 1 N α i α jyiyj ( xi ∗ xj ) − ∑ i = 1 N α i min \frac{1}{2} \sum_{ i=1}^{N}\sum_{j=1}^{N} \alpha_i \alpha _j y_iy_j(x_i*x_j)-\sum_{i=1}^{N} \alpha _imin21i=1Nj=1Naiajyiyj(xixj)i=1Nai
s . t . ∑ i = 1 N α i y i = 0 s.t. \sum_{i=1}^{N} \alpha _iy_i=0 s.t.i=1Naiyi=0
0 ≤ α i ​​≤ C , i = 1 , 2 , . . . . . . . . , N 0\alpha _i\and C,i=1,2,...,N0aiC,i=1,2,...,The dual learning method of N-
linear support vector machine, first solve the dual problem to get the optimal solutionα ∗ \alpha^*a , and then find the optimal solution to the original problemw ∗ w^*w, b ∗ b^* b , get the separating hyperplane and classification decision function.
The solution of the dual problemα ∗ \alpha^*a Satisfiesα i ∗ ≥ 0 \alpha_i^*\ge0aiThe instance pointxi x_i of 0xicalled support vectors. The support vector can be on the margin boundary, or between the margin boundary and the separating hyperplane, or on the misclassified side of the separating hyperplane. The optimal separating hyperplane is fully determined by the support vectors.
Linear SVM learning is equivalent to minimizing the second-order norm regularized hinge function ∑ i = 1 N [ 1 − yi ( w ⋅ xi + b ) ] + + λ ∣ ∣ w ∣ ∣ 2 \sum_{i =1}^{N}[1-y_i(w·x_i+b)]_++\lambda ||w||^2i=1N[1yi(wxi+b)]++λ∣∣w2

Nonlinear Support Vector Machines

For nonlinear problems in the input space, it can be transformed into a linear classification problem in a high-dimensional feature space through nonlinear transformation, and a linear support vector machine can be learned in the high-dimensional feature space. Since in the dual problem of linear support vector machines, both the objective function and the classification decision function only involve the inner product between instances, there is no need to explicitly formulate nonlinear transformation, but the inner product is replaced by the kernel function .

The kernel function represents the inner product between two instances after a nonlinear transformation. Specifically, K ( x , z ) K(x,z)K(x,z ) is a kernel function, or a positive definite sum, which means that there exists an input spaceχ \chiχ to feature spaceℜ \Re mapping.

Therefore, in the dual problem of linear support vector machine learning, use the kernel function K ( x , z ) K(x,z)K(x,z ) instead of the inner product, the solution is the nonlinear support vector machine f ( x ) = sign ( ∑ i = 1 N α i ∗ yi K ( x , xi ) + b ∗ ) f(x)=sign(\sum_ {i=1}^{N}\alpha _i^*y_iK(x,x_i) +b^*)f(x)=sign(i=1NaiyiK(x,xi)+b)

SMO algorithm

s e q u e n t i a l m i n i m a l o p t i m i z a t i o n , S M O sequential \quad minimal \quad optimization,SMO sequentialminimalo pt imi z a t i o n , SMO algorithm is a fast algorithm for support vector machine learning, which is characterized by continuously decomposing the original quadratic programming problem into quadratic programming subproblems with only two variables, and The problem is solved analytically until all variables satisfy the KKT conditions. In this way, the optimal solution of the original quadratic programming problem is obtained through the heuristic method. Because the sub-problem has an analytical solution, the calculation of the sub-problem is very fast each time. Although the calculation of the sub-problem is many times, it is still efficient overall.

Guess you like

Origin blog.csdn.net/m0_52427832/article/details/127087982