Machine Learning (7) of the support vector machine (SVM)

@

1 Knowledge Review

1.1 gradient descent

Derivative : a function at some point describe the rate of change of the derivative of this function in the vicinity of this point may be considered to be a function of a point in the curve is the derivative of the function represented by the slope of the tangent at this point. The larger the value of the derivative, the greater the change of the function at that point.

Gradient : Gradient is a vector representing a function at the point directional derivatives maximum taken along the direction, i.e. the direction along which the fastest function of the change at this point, the maximum rate of change (i.e., the gradient vector mode); when the function is a one-dimensional function of time, in fact, a gradient of the derivative.
Here Insert Picture Description

Gradient descent method (Gradient Descent, GD) commonly used in the minimum convex function (Convex Function) of the unconstrained case, is a type of iterative algorithm because only one extreme point convex function, so it is extremely small Solving the minimum value point is the point function.
Here Insert Picture Description
Optimization of thinking is negative gradient descent gradient direction with the current position as the search direction, because the direction of steepest descent direction of the current location, so the gradient descent method is also known as "the steepest descent method." Gradient descent method is closer to the target value, the smaller the variable change. Is calculated as follows:
Here Insert Picture Description
[alpha] is referred to as a step or a learning rate (learning rate), since the variable x represents the magnitude of change in each iteration.
Convergence conditions : When the function changes the value of the objective function of a very young age or the maximum number of iterations when the cycle ends.

1.2 Lagrange multipliers

Lagrange multipliers where one is in the presence of equivalent function of our optimization constraint solving optimization mode; wherein the parameter α is referred to as a Lagrange multiplier , is not equal to α 0 requirements.
Here Insert Picture Description

1.2.1 dual problem

In the optimization problem, the objective function f (x) exists in various forms, if the objective function and constraints are linear function of the variable x, the problem is called linear programming ; if the objective function is a quadratic function, called optimization problems as quadratic programming ; if the objective function or constraint is nonlinear function optimization problem is called nonlinear optimization . Each linear programming problem has a corresponding dual problem . It has the following features dual problem:

  1. Dual duality is the original problem;
  2. Regardless of whether the original problem is convex dual problem is a convex optimization problem;
  3. Dual problem can be given a lower bound of the original problem;
  4. When certain conditions are met, the original problem and the solution to the dual problem is perfectly equivalent.

1.3 KKT conditions

KKT condition is a form of pan-Lagrange multiplier method; A major application when there is no equivalent in the case of constrained optimization function of our optimization solver way; KKT conditions, ie under conditions of inequality constraints satisfied .
Here Insert Picture Description
KKT conditions to understand: Reference Links

  • Feasible solution must be within the confinement region g (x), the feasible solution is apparent from FIG made only in the region x g (x) <0, and g (x) = 0; and
    • When feasible solution in the region of x g (x) <0 in this case the direct minimization of f (x) can be obtained;
    • When feasible solution in the region of x g (x) = 0 in this case is directly equivalent to solving the equality constraint problems.
  • When a feasible solution when the interior region of the constraint, so that β = 0 constraint can be eliminated.
  • For the value of the parameter beta], in the equivalence constraints, the constraints and the objective function gradient function as long as parallel to, and in the inequality constraints, if β ≠ 0, then the constraint on the boundary of the feasible region of the solution, this time should be a feasible solution in the case of solution near unconstrained as possible, so that the constraint on the boundary, the direction of the negative gradient of the objective function when the solutions should be unconstrained region towards the region remote from the constraint, the constraint at this time the direction of the objective function gradient function negative gradient direction should be the same; can be drawn β> 0.

1.3.1 KKT conditions are summarized

  1. Lagrange obtain necessary and sufficient conditions for feasible solutions;
  2. Inequality constraints to a constraint on the converted called Complementarity relaxation;
  3. The initial constraints;
  4. The initial constraints;
  5. Inequality constraints need to be met.
    Here Insert Picture Description

Solving Optimization Problems 1.4

Generally it refers to an optimization problem for a particular function, solving the global minimization problem on its specified scope, generally divided into the following three cases (Note: several ways to find out solutions are likely to be very local a small value, only when the function is a convex function of time, can obtain a global minimum):

Unconstrained problem : Solution Solution embodiment mode generally gradient descent method, Newton method, coordinate axes descent method.
Equality constraints : Solving way typically Lagrange multipliers.
Inequality constraints : Solving way typically KKT conditions.

1.5 from Knowledge Review

  • Point to the straight / planar distance formula:
    • Assumed point p (x0, y0), the plane equation is f (x, y) = Ax + By + C, then the point p to the plane f (x) a distance:
      Here Insert Picture Description
    • From the three-dimensional space extended to multi-dimensional space, if there is a hyperplane f (X) = θX + b ; then a single point X- 0 to the distance of the hyperplane:

      Here Insert Picture Description
      The formulas denominator function of distance , the molecule is the geometric distance .

1.6 Perceptron model

Perceptron algorithm is one of the oldest classification algorithm, the principle is simple, but the classification generalization ability of the model is relatively weak, but perceptual model is the basis of SVM, neural networks, the depth of learning algorithms. Perceptron idea is simple: a class has many students, divided into boys and girls, perceptual model is to try to find a straight line, can put all the male students and female students separated, if a high-dimensional space, Perceptron We are looking for a super model plane, able to put all binary categories separated. Perceptron model is the premise: the data is linearly separable .

  1. For m samples, each sample n-dimensional feature category and a binary output y, as follows:
    Here Insert Picture Description
  2. The goal is to find a super-plane, namely:
    Here Insert Picture Description
  3. Let's sample a category satisfying: θ · x> 0; Another category of meet: θ · x> 0.
  4. Perceptron model:
    Here Insert Picture Description
  5. Correct classification: yθx> 0, misclassification: yθx <0; therefore we can define our damage function: all desired samples (m strip sample) to the hyperplane and the minimum distance that the misclassification.
    Here Insert Picture Description
    Because the numerator and denominator include the θ value, when the expansion of the molecule when the N times, the denominator will also expand, i.e. the presence of multiple relationships between the numerator and the denominator, the molecule or may be fixed denominator is 1, then another demand or minimized i.e. the reciprocal of the numerator and denominator as a function of a loss, the loss function is simplified (as a denominator):
    Here Insert Picture Description
  6. Directly using a gradient descent method can be solved loss function, but because where m is the misclassification of sample collection point , not fixed, so we can not be solved using the batch gradient descent (BGD), can only use the stochastic gradient descent (SGD ) or small batch gradient descent (MBGD); SGD is generally used in the perceptual model to solve.
    Here Insert Picture Description

2 SVM

支持向量机(Support Vecor Machine, SVM)本身是一个二元分类算法,是对感知器算法模型的一种扩展,现在的SVM算法支持线性分类非线性分类的分类应用,并且也能够直接将SVM应用于回归应用中,同时通过OvR或者OvO的方式我们也可以将SVM应用在多元分类领域中。在不考虑集成学习算法,不考虑特定的数据集的时候,在分类算法中SVM可以说是特别优秀的。
Here Insert Picture Description

2.1 线性可分SVM

在感知器模型中,算法是在数据中找出一个划分超平面,让尽可能多的数据分布在这个平面的两侧,从而达到分类的效果,但是在实际数据中这个符合我们要求的超平面是可能存在多个的。
Here Insert Picture Description
在感知器模型中,可以找到多个可以分类的超平面将数据分开,并且优化时希望所有的点都离超平面尽可能的远,但是实际上离超平面足够远的点基本上都是被正确分类的,所以这个是没有意义的;反而比较关心那些离超平面很近的点,这些点比较容易分错。所以说我们只要让离超平面比较近的点尽可能的远离这个超平面

  • 线性可分(Linearly Separable):在数据集中,如果可以找出一个超平面,将两组数据分开,那么这个数据集叫做线性可分数据。
  • 线性不可分(Linear Inseparable):在数据集中,没法找出一个超平面,能够将两组数据分开,那么这个数据集就叫做线性不可分数据。
  • 分割超平面(Separating Hyperplane):将数据集分割开来的直线/平面叫做分割超平面。
  • 间隔(Margin):数据点到分割超平面的距离称为间隔。
  • 支持向量(Support Vector):离分割超平面最近的那些点叫做支持向量。
  • 支持向量到超平面的距离为
    Here Insert Picture Description
    Here Insert Picture Description
    备注:在SVM中支持向量到超平面的函数距离一般设置为1。
  • SVM模型是让所有的分类点在各自类别的支持向量的两边,同时要求支持向量尽可能的原理这个超平面,用数学公式表示如下:
    Here Insert Picture Description
  • 将上式子优化为SVM的损失函数为:
    Here Insert Picture Description
    将此时的目标函数和约束条件使用KKT条件转换为拉格朗日函数,从而转换为无约束的优化函数。
    Here Insert Picture Description
    引入拉格朗日乘子后,优化目标变成:
    Here Insert Picture Description
    根据拉格朗日对偶化特性,将该优化目标转换为等价的对偶问题来求解,从而优化目标变成:
    Here Insert Picture Description
    所以对于该优化函数而言,可以先求优化函数对于w和b的极小值,然后再求解对于拉格朗日乘子β的极大值。

  • 首先求让函数L极小化的时候w和b的取值,这个极值可以直接通过对函数L分别求w和b的偏导数得到:
    Here Insert Picture Description
    将求解出来的w和b带入优化函数L中,定义优化之后的函数如下:
    Here Insert Picture Description
  • 通过对w、b极小化后,我们最终得到的优化函数只和β有关,所以此时我们可以直接极大化我们的优化函数,得到β的值,从而可以最终得到w和b的值。
    Here Insert Picture Description
  • 假设存在最优解β*; 根据w、b和β的关系,可以分别计算出对应的w值和b值(一般使用所有支持向量的计算均值来作为实际的b值);
    Here Insert Picture Description
    这里的(xs,ys)即支持向量,根据KKT条件中的对偶互补条件(松弛条件约束),支持向量必须满足一下公式:
    Here Insert Picture Description

2.1.1 算法流程

  • 输入线性可分的m个样本数据{(x1,y1),(x2,y2),..,(xm,ym)},其中x为n维的特征向量,y为二元输出,取值为+1或者-1;SVM模型输出为参数w、b以及分类决策函数。
    1. 构造约束优化问题;
      Here Insert Picture Description
    2. 使用SMO算法求出上式优化中对应的最优解β*;
    3. 找出所有的支持向量集合S;
      Here Insert Picture Description
    4. 更新参数w、b的值;
      Here Insert Picture Description
    5. 构建最终的分类器。
      Here Insert Picture Description

2.1.1 算法总结

  1. 要求数据必须是线性可分的;
  2. 纯线性可分的SVM模型对于异常数据的预测可能会不太准;
  3. 对于线性可分的数据,SVM分类器的效果非常不错。

2.2 SVM的软间隔模型

线性可分SVM中要求数据必须是线性可分的,才可以找到分类的超平面,但是有的时候线性数据集中存在少量的异常点,由于这些异常点导致了数据集不能够线性划分;直白来讲就是:正常数据本身是线性可分的,但是由于存在异常点数据,导致数据集不能够线性可分;
Here Insert Picture Description
如果线性数据中存在异常点导致没法直接使用SVM线性分割模型的时候,可以通过引入软间隔的概念来解决这个问题;

硬间隔:可以认为线性划分SVM中的距离度量就是硬间隔,在线性划分SVM中,要求函数距离一定是大于1的,最大化硬间隔条件为:
Here Insert Picture Description
软间隔:SVM对于训练集中的每个样本都引入一个松弛因子(ξ),使得函数距离加上松弛因子后的值是大于等于1;这表示相对于硬间隔,对样本到超平面距离的要求放松了。
Here Insert Picture Description
松弛因子(ξ)越大,表示样本点离超平面越近,如果松弛因子大于1,那么表示允许该样本点分错,所以说加入松弛因子是有成本的,过大的松弛因子可能会导致模型分类错误,所以最终的目标函数就转换成为:
Here Insert Picture Description
备注:函数中的C>0是惩罚参数,是一个超参数,类似L1/L2 norm的参数;C越大表示对误分类的惩罚越大,C越小表示对误分类的惩罚越小;C值的给定需要调参。

  • 同线性可分SVM,构造软间隔最大化的约束问题对应的拉格朗日函数如下:
    Here Insert Picture Description
  • 从而将我们的优化目标函数转换为:
    Here Insert Picture Description
  • 优化目标同样满足KKT条件,所以使用拉格朗日对偶将优化问题转换为等价的对偶问题:
    Here Insert Picture Description
  • 先求优化函数对于w、b、ξ的极小值,这个可以通过分别对优化函数L求w、b、ξ的偏导数得,从而可以得到w、b、ξ关于β和μ之间的关系。
    Here Insert Picture Description
  • 将w、b、ξ的值带入L函数中,就可以消去优化函数中的w、b、ξ,定义优化之后的函数如下:
    Bold Style
  • 最终优化后的目标函数/损失函数和线性可分SVM模型基本一样,除了约束条件不同而已, 也就是说也可以使用SMO算法来求解。
    Here Insert Picture Description

2.2.1 算法流程

M linearly separable input data samples {(x . 1 , y . 1 ), (x 2 , y 2 ), ..., (x m , y m )}, where x is a n-dimensional feature vector, y is the binary outputs of A value of +1 or -1; SVM model output parameters w, b and a classification decision function.

  1. Selecting a penalty factor C> 0, configured constrained optimization problem;
    Here Insert Picture Description
  2. SMO algorithm derived using the formula corresponding to the optimal solution optimization β *;
  3. Find all support vector set S;
    Here Insert Picture Description
  4. Update parameters W , B values;
    Here Insert Picture Description
  5. Construction of the final classifier.
    Here Insert Picture Description

2.2.1 algorithm summary

  1. You can solve the problem linear data carried outlier classification models built;
  2. By introducing a penalty coefficient (relaxation factor), you can increase the generalization ability, i.e. robustness;
  3. If given the smaller the penalty coefficient indicates when model building, it allowed the more misclassification of sample, also express at this time the accuracy of the model will be relatively low; if the punishment coefficient greater representation in when building the model, the more there is not allowed to misclassification of sample, also express at this time the accuracy of the model will be higher.

Guess you like

Origin www.cnblogs.com/tankeyin/p/12155730.html