Machine learning techniques 2-Dual Support Vector Machine

Note:
the article all the pictures are from the National Taiwan University, Lin Xuan Tian "machine learning techniques" course.
The author notes: red stone
micro-channel public number: AI Youdao

The lesson introduces linear support vector machine (Linear Support Vector Machine). Linear SVM goal is to find the most "fat" dividing line separating positive and negative class, method is to use quadratic programming is obtained sorting line. This lesson will start with another aspect, the dual study support vector machine (Dual Support Vector Machine), try a new angle calculated sorting line, to promote the application of SVM.

1. Motivation of Dual SVM

First, we recall, the nonlinear SVM, we can usually use a nonlinear transformation from variable (x \) \ domain converted to \ (z \) in the domain. Then, the \ (Z \) domain, based on the content of a lesson, the problem can be solved using the linear SVM. A lesson we have said, the SVM to give large-margin, reducing the effective VC Dimension, limits the complexity of the model; on the other hand, the conversion characteristics used, to let a more complex model, reducing the \ (E_ {in } \) . Therefore, the nonlinear SVM is the purpose of this combination of the two together, the balance between the two. Then, the conversion characteristics, solving the QP problem \ (Z \) dimension domain set \ (\ hat {d} +1 \) , if more complex model, the \ (\ hat {d} +1 \) the larger, corresponding QP solving this problem has become very difficult. When \ (\ hat {d} \ ) infinite time, will become difficult to solve the problem, then there is no way to solve this problem? One method is to make the solution process does not depend on SVM \ (\ Hat {D} \) , which is the subject of our present lesson to be discussed.

Comparison, the number of variables that we are talking about a lesson Original SVM quadratic programming problem is \ (\ Hat {D} + 1'd \) , there are \ (N \) limitation condition; and this class, we the problem into the dual problem ( 'Equivalent' SVM), is also a quadratic programming, but the number of variables becomes \ (N \) months, have\ (N + 1 \) limitation conditions. This dual benefit is SVM problem only with \ (N \) , but not with \ (\ hat {d} \ ) irrelevant, so there is when the above-mentioned \ (\ hat {d} \ ) infinity when the case difficult to solve.

How to put into question the dual problem ( 'Equivalent' SVM), which is very complex mathematical derivation, not detailed mathematical arguments, but will derive from the simple concepts and principles.

In "Machine Learning cornerstone" Regularization course described in, minimizing \ (E_ {in} \) added during the restrictions: \ (Tw W ^ \ Leq C \) . Our method is to introduce solving Lagrangian factor \ (\ the lambda \) , converting the minimization problem for conditional and unconditional minimization problem: \ [min \-Aug of E_ {} (W) in {} of E_ = ( W) + \ FRAC {\ the lambda} ^ {N} Tw W \] , resulting \ (W \) optimal to resolve: \ [\ nabla of E_ {} in (W) + \ FRAC {2 \ the lambda } {N} w = 0 \
] Therefore, in the regularization problem, \ (\ the lambda \) is a known constant, the solution process is easy. So, for dual SVM problem, the same can be introduced \ (\ the lambda \) , non-converted Condition Condition, but \ (\ the lambda \) are unknown parameters, and the number is \ (N \) , the need for it solved.

How to convert a non-issue condition Condition? SVM us a lesson, the goal is: \ (min \ \ FRAC. 1} {2} {Tw W ^ \) , with the proviso that: \ (y_n (B + W ^ Tz_n) \ GEQ. 1, \ for \ = n-1,2, \ cdots, N \) . First, we make the Lagrange factor \ (\ alpha_n \)(As distinguished from regularization), to construct a function:
\ [L (B, W, \ Alpha) = \ FRAC {. 1} {2} W ^ Tw + \ sum_ {n-=. 1} ^ N \ alpha_n (. 1-y_n (W Tz_n + B ^)) \] , the first term on the right is a function of SVM target, the second condition is the Lagrangian and SVM factor \ (\ alpha_n \) product. We call this function is called Lagrange function, which includes three parameters: \ (B, W, \ alpha_n \) .

Here, Lagrange function, the SVM is configured to condition a non-issue:

the minimization problem contains the maximization problem, how to explain it? First we predetermined Lagrange factor \ (\ alpha_n \ geq0 \) , defined in accordance with the availability condition SVM: \ ((. 1-y_n (W ^ Tz_n + B)) \ Leq 0 \) , if there is no optimal solution, i.e., not satisfying \ ((1-y_n (w ^ Tz_n + b)) \ leq 0 \) case, because \ (\ alpha_n \ GEQ 0 \) , then there must be \ (\ sum_n \ alpha_n (1 -y_n (W ^ Tz_n + B)) \ GEQ 0 \) . This case is greater than zero, and its maximum is no solution. If all the points are satisfied \ ((. 1-y_n (W ^ Tz_n + B)) \ Leq 0 \) , then there must be \ (\ sum_n \ alpha_n (1 -y_n (w ^ Tz_n + b)) \ 0 Leq \) , then when\ (\ sum_n \ alpha_n (1 -y_n (w ^ Tz_n + b)) = 0 \) when it has a maximum value, the maximum value of the SVM is the target: \ (\ FRAC. 1} {2} {W ^ Tw \ ) . Thus, this non-conversion conditions SVM constructor forms are possible.

2. Lagrange Dual SVM

Now, we have the SVM problem into the Lagrange factor \ (\ alpha_n \) related to the maximum and minimum form. Known \ (\ alpha_n \ GEQ 0 \) , then for any fixed \ (\ Alpha '\) , and \ (\ alpha_n' \ GEQ 0 \) , the following inequality must:

maximum value of the right side of the above inequality , inequality is also true:

the above inequality that we do for the SVM min and max swap meet such a relationship, which is called the Lagrange dual problem. SVM is a lower bound on the right inequality problem, our next goal is to find the lower bound.
Known \ (\ geq \) is a weak duality relations, the QP quadratic programming problems, if they meet the following three conditions:

  • Function is convex (convex primal)
  • Function is solvability (feasible primal)
  • Conditions is linear (linear constraints)

那么,上述不等式关系就变成强对偶关系,\(\geq\)变成\(=\),即一定存在满足条件的解\((b,w,\alpha)\),使等式左边和右边都成立,SVM的解就转化为右边的形式。
经过推导,SVM对偶问题的解已经转化为无条件形式:

其中,上式括号里面的是对拉格朗日函数\(L(b,w,\alpha)\)计算最小值。那么根据梯度下降算法思想:最小值位置满足梯度为零。首先,令\(L(b,w,\alpha)\)对参数\(b\)的梯度为零:

也就是说,最优解一定满足\(\sum_{n=1}^N\alpha_ny_n=0\)。那么,我们把这个条件代入计算max条件中(与\(\alpha_n\geq 0\)同为条件),并进行化简:

这样,SVM表达式消去了\(b\),问题化简了一些。然后,再根据最小值思想,令\(L(b,w,\alpha)\)对参数\(w\)的梯度为零:

即得到\[w=\sum_{n=1}^{N}\alpha_n y_n z_n\]也就是说,最优解一定满足\(w=\sum_{n=1}^N\alpha_ny_nz_n\)。那么,同样我们把这个条件代入并进行化简:

这样,SVM表达式消去了\(w\),问题更加简化了。这时候的条件有3个:

  • all \(\alpha_n \geq 0\)
  • \(\sum_{n=1}^N \alpha_n y_n=0\)
  • \(w=\sum_{n=1}^{N}\alpha_n y_n z_n\)

SVM简化为只有\(\alpha_n\)的最佳化问题,即计算满足上述三个条件下,函数\(-\frac{1}{2}\|\sum_{n=1}^N\alpha_ny_nz_n\|^2+\sum_{n=1}^N\alpha_n\)最小值时对应的\(\alpha_n\)是多少。
总结一下,SVM最佳化形式转化为只与\(\alpha_n\)有关:

其中,满足最佳化的条件称之为Karush-Kuhn-Tucker(KKT):

在下一部分中,我们将利用KKT条件来计算最优化问题中的\(\alpha\),进而得到\(b\)\(w\)

3. Solving Dual SVM

上面我们已经得到了dual SVM的简化版了,接下来,我们继续对它进行一些优化。首先,将max问题转化为min问题,再做一些条件整理和推导,得到:

显然,这是一个convex的QP问题,且有\(N\)个变量\(\alpha_n\),限制条件有\(N+1\)个。则根据上一节课讲的QP解法,找到\(Q,p,A,c\)对应的值,用软件工具包进行求解即可。

求解过程很清晰,但是值得注意的是,\(q_{n,m}=y_ny_mz^T_nz_m\),大部分值是非零的,称为dense。当\(N\)很大的时候,例如\(N=30000\),那么对应的\(Q_D\)的计算量将会很大,存储空间也很大。所以一般情况下,对dual SVM问题的矩阵\(Q_D\),需要使用一些特殊的方法,这部分内容就不再赘述了。

得到\(\alpha_n\)之后,再根据之前的KKT条件,就可以计算出\(w\)\(b\)了。首先利用条件\(w=\sum\alpha_ny_nz_n\)得到w,然后利用条件\(\alpha_n(1-y_n(w^Tz_n+b))=0\),取任一\(\alpha_n\neq 0\)\(\alpha_n>0\)的点,得到\(1-y_n(w^Tz_n+b)=0\),进而求得\(b=y_n-w^Tz_n\)

值得注意的是,计算\(b\)值,\(\alpha_n>0\)时,有\(y_n(w^Tz_n+b)=1\)成立。\(y_n(w^Tz_n+b)=1\)正好表示的是该点在SVM分类线上,即fat boundary。也就是说,满足\(\alpha_n>0\)的点一定落在fat boundary上,这些点就是Support Vector。这是一个非常有趣的特性。

4. Messages behind Dual SVM

回忆一下,上一节课中,我们把位于分类线边界上的点称为support vector(candidates)。本节课前面介绍了\(\alpha_n>0\)的点一定落在分类线边界上,这些点称之为support vector(注意没有candidates)。也就是说分类线上的点不一定都是支持向量,但是满足\(\alpha_n>0\)的点,一定是支持向量。

SV只由\(\alpha_n>0\)的点决定,根据上一部分推导的\(w\)\(b\)的计算公式,我们发现,\(w\)\(b\)仅由SV即\(\alpha_n>0\)的点决定,简化了计算量。这跟我们上一节课介绍的分类线只由“胖”边界上的点所决定是一个道理。也就是说,样本点可以分成两类:一类是support vectors,通过support vectors可以求得fattest hyperplane;另一类不是support vectors,对我们求得fattest hyperplane没有影响。

回过头来,我们来比较一下SVM和PLA的\(w\)公式:

二者在形式上是相似的。\(w_{SVM}\)由fattest hyperplane边界上所有的SV决定,\(w_{PLA}\)由所有当前分类错误的点决定。\(w_{SVM}\)\(w_{PLA}\)都是原始数据点\(y_nz_n\)的线性组合形式,是原始数据的代表。

总结一下,本节课和上节课主要介绍了两种形式的SVM,一种是Primal Hard-Margin SVM,另一种是Dual Hard_Margin SVM。Primal Hard-Margin SVM有\(\hat{d}+1\)个参数,有\(N\)个限制条件。当\(\hat{d}+1\)很大时,求解困难。而Dual Hard_Margin SVM有\(N\)个参数,有\(N+1\)个限制条件。当数据量\(N\)很大时,也同样会增大计算难度。两种形式都能得到\(w\)\(b\),求得fattest hyperplane。通常情况下,如果\(N\)不是很大,一般使用Dual SVM来解决问题。

这节课提出的Dual SVM的目的是为了避免计算过程中对\(\hat{d}\)的依赖,而只与\(N\)有关。但是,Dual SVM是否真的消除了对\(\hat{d}\)的依赖呢?其实并没有。因为在计算\(q_{n,m}=y_ny_mz_n^Tz_m\)的过程中,由\(z\)向量引入了\(\hat{d}\),实际上复杂度已经隐藏在计算过程中了。所以,我们的目标并没有实现。下一节课我们将继续研究探讨如何消除对\(\hat{d}\)的依赖。

5. 总结

本节课主要介绍了SVM的另一种形式:Dual SVM。我们这样做的出发点是为了移除计算过程对\(\hat{d}\)的依赖。Dual SVM的推导过程是通过引入拉格朗日因子\(\alpha\),将SVM转化为新的非条件形式。然后,利用QP,得到最佳解的拉格朗日因子\(\alpha\)。再通过KKT条件,计算得到对应的\(w\)\(b\)。最终求得fattest hyperplane。下一节课,我们将解决Dual SVM计算过程中对\(\hat{d}\)的依赖问题。

Guess you like

Origin www.cnblogs.com/SweetZxl/p/10985380.html
Recommended