02-31 linear support vector machine

Newer and more full of "machine learning" to update the site, more python, go, data structures and algorithms, reptiles, artificial intelligence teaching waiting for you: https://www.cnblogs.com/nickchen121/

Linear support vector machine

Linear support vector machine can be divided when it comes to linearly separable support vector machines have a disadvantage of not doing handling of outliers, it is precisely because of these outliers cause data to become linearly inseparable or just because it is judged to be supported vector generalization ability of the model leads to deterioration.

# 异常点导致数据线性不可分图例
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

x1 = [2, 2.5, 3.2, 6.5]
x11 = [1, 4.5, 5, 6]
x2 = [1.2, 1.4, 1.5, 1.2]
x22 = [1, 1.5, 1.3, 1]

plt.scatter(x1, x2, s=50, color='b')
plt.scatter(x11, x22, s=50, color='r')
plt.vlines(3.9, 0.8, 2, colors="g", linestyles="-",
           label='$w*x+b=0$', alpha=0.2)
plt.text(1.1, 1.1, s='异常点A', fontsize=15, color='k',
         ha='center', fontproperties=font)
plt.text(6.3, 1.3, s='异常点B', fontsize=15, color='k',
         ha='center', fontproperties=font)
plt.legend()
plt.show()

png

The figure can be seen that due to abnormal outlier points A and B may not result in linearly separable support vector machines according to the data set classification.

# 异常点导致模型泛化能力变差图例
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
from sklearn import svm
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

np.random.seed(8)  # 保证数据随机的唯一性

# 构造线性可分数据点
array = np.random.randn(20, 2)
X = np.r_[array-[3, 3], array+[3, 3]]
y = [0]*20+[1]*20

# 建立svm模型
clf = svm.SVC(kernel='linear')
clf.fit(X, y)

# 构造等网个方阵
x1_min, x1_max = X[:, 0].min(), X[:, 0].max(),
x2_min, x2_max = X[:, 1].min(), X[:, 1].max(),
x1, x2 = np.meshgrid(np.linspace(x1_min, x1_max),
                     np.linspace(x2_min, x2_max))

# 得到向量w: w_0x_1+w_1x_2+b=0
w = clf.coef_[0]
# 加1后才可绘制 -1 的等高线 [-1,0,1] + 1 = [0,1,2]
f = w[0]*x1 + w[1]*x2 + clf.intercept_[0] + 1

# 绘制H2,即wx+b=-1
plt.contour(x1, x2, f, [0], colors='k', linestyles='--', alpha=0.1)
plt.text(2, -4, s='$H_2={\omega}x+b=-1$', fontsize=10, color='r', ha='center')

# 绘制分隔超平面,即wx+b=0
plt.contour(x1, x2, f, [1], colors='k', alpha=0.1)
plt.text(2.5, -2, s='$\omega{x}+b=0$', fontsize=10, color='r', ha='center')
plt.text(2.5, -2.5, s='分离超平面', fontsize=10,
         color='r', ha='center', fontproperties=font)

# 绘制H1,即wx+b=1
plt.contour(x1, x2, f, [2], colors='k', linestyles='--')
plt.text(3, 0, s='$H_1=\omega{x}+b=1$', fontsize=10, color='r', ha='center')

# 绘制数据散点图
plt.scatter(X[0:20, 0], X[0:20, 1], cmap=plt.cm.Paired, marker='x')
plt.text(1, 1.8, s='支持向量', fontsize=10, color='gray',
         ha='center', fontproperties=font)

plt.scatter(X[20:40, 0], X[20:40, 1], cmap=plt.cm.Paired, marker='o')
plt.text(-1.5, -0.5, s='支持向量', fontsize=10,
         color='gray', ha='center', fontproperties=font)

# 绘制假设的异常点及假设的间隔边界,**毫无意义的代码**,为了讲解用
plt.scatter(-1.5, 3, marker='x', c='b')
plt.text(-1.5, 3.1, s='异常点A', fontsize=10,
         color='r', ha='center', fontproperties=font)
x_test = np.linspace(-3, 4, 666)
y_test = -(w[0]+0.12)*x_test - (w[1]-0.002)*x_test+clf.intercept_[0]+1.8
y_test_test = -(w[0]+0.12)*x_test - (w[1]-0.002)*x_test+clf.intercept_[0]+2.3

plt.plot(x_test, y_test, linestyle='--')
plt.plot(x_test, y_test_test)

plt.xlim(x1_min-1, x1_max+1)
plt.ylim(x2_min-1, x2_max+1)
plt.show()

png

The figure can be seen that since the outlier results in separating hyperplane into a pale yellow solid piece from the line graph, it is evident that from FIG outlier seriously affected generalization ability of the model.

The mentioned linear support vector machine (linear support vector machine) will be modified linearly separable support vector machine hard intervals to maximize the soft margin maximization solve a problem caused by abnormal points above.

A linear support vector machine learning objectives

  1. Hard and soft spacing interval maximize maximize difference
  2. SVM linear support vector machine
  3. Hinge loss function
  4. Linear SVM step

Second, the linear support vector machine Detailed

2.1 hardware and software to maximize every interval maximize

2.1.1 hard to maximize the interval

Linearly separable comes SVM optimization problem linearly separable support vector machine objective function to maximize belonging to hard interval, i.e.
\ [\ begin {align} & \ underbrace {\ min} _ {\ omega, b} {\ frac {1} {2 }} {|| \ omega ||} ^ 2 \\ & st \ quad y_i (\ omega {x_i} + b) \ geq1, \ quad i = 1,2, \ ldots, m \ end {align} \]

2.1.2 soft maximize interval

Mentioned in the previous section can result in data outliers linearly inseparable, which means that some functions can not answer the sample interval is greater than equal to \ (1 \) constraints, in order to solve this problem for each sample point can \ ((x_i, y_i) \) introducing a slack variable \ (\ xi_i \ geq0 \) , so that the function equal interval plus the slack variable is greater than 1, so the constraint becomes
\ [y_i (wx_i + b)
\ geq1- \ xi_i \] phase relatively hard maximize interval, the sample can be seen to the function of the distance separating hyperplane relaxed requirements, must be greater than or equal before \ (1 \) , now only need to add a greater than or equal \ (0 \) is the slack variable is greater than equal to \ (1 \) can be. Meanwhile, each slack variable \ (\ xi_i \) corresponds to a cost of \ (\ xi_i \) , the objective function becomes
\ [{\ frac {1} {2}} {|| \ omega ||} ^ 2 + C \ sum_ {i = 1
} ^ m \ xi_i \] wherein \ (C> 0 \) called penalty parameter, \ (C \) The larger the value, the greater the penalty for misclassified samples, the smaller the space width ; \ (C \) value, the smaller penalty for misclassified samples, i.e., the larger the width of the interval.

Now the objective function has two meanings, i.e. desired \ ({\ frac {1} {2}} {|| \ omega ||} ^ 2 \) as small as possible while also misclassified samples as small as possible, \ ( C \) is coordinating relationship between the two regularization penalty coefficient. The above ideas to more hard spaced intervals maximizing called soft maximized, and thus the original problem as a linear support vector machine
\ [\ begin {align} & \ underbrace {\ min} _ {\ omega, b, \ xi} {\ frac {1} {2 }} {|| \ omega ||} ^ 2 + C \ sum_ {i = 1} ^ m \ xi_i \\ & st \ quad y_i (\ omega {x_i} + b) \ geq1- \ xi_i, \ quad i = 1,2, \ ldots, m \\ & \ quad \ quad \ xi_i \ geq0, \ quad i = 1,2, \ ldots, m \ end {align} \]

2.2 linear support vector machine defined

For a given linearly inseparable training data set, by solving the original problem of a linear support vector, i.e. convex quadratic programming problem that is soft interval maximization problem, the problem is assumed to explain \ (w ^ * \) and \ ( B ^ * \) , separating hyperplanes as
\ [w ^ * x + b
^ * = 0 \] classification decision function
\ [f (x) = sign
(w ^ * x + b *) \] , said as a linear support vector machine.

2.3 soft interval that is optimized to maximize the objective function

Original problem of linear support vector machine
\ [\ begin {align} & \ underbrace {\ min} _ {\ omega, b, \ xi} {\ frac {1} {2}} {|| \ omega ||} ^ 2 + C \ sum_ {i = 1} ^ m \ xi_i \\ & st \ quad y_i (\ omega {x_i} + b) \ geq1- \ xi_i, \ quad i = 1,2, \ ldots, m \ \ & \ quad \ quad \ xi_i
\ geq0, \ quad i = 1,2, \ ldots, m \ end {align} \] can be obtained by the original problem original linear SVM problem is Lagrangian
\ [L (\ omega, b, \ xi, \ alpha, \ mu) = {\ frac {1} {2}} {|| w ||} ^ 2 + C \ sum_ {i = 1} ^ m \ xi_i - \ sum_ {i = 1}
^ m \ alpha_i (y_i (wx_i + b) -1+ \ xi_i) - \ sum_ {i = 1} ^ m \ mu_i \ xi_i \] wherein \ (\ mu_i \ geq0, \ alpha_i \ geq0 \) , are the Lagrange coefficients.

I.e., now we need to optimize the objective function
\ [\ underbrace {\ min} _ {\ omega, b, \ xi} \ underbrace {\ max} _ {\ mu_i \ geq0, \ alpha_i \ geq0} L (\ omega, b , \ alpha, \ xi, \
mu) \] Since the objective function of optimization KKT conditions are satisfied, i.e., may be even above optimization problem into an equivalent problem by the Lagrange dual
\ [\ underbrace {\ max} _ {\ mu_i \ geq0, \
alpha_i \ geq0} \ underbrace {\ min} _ {\ omega, b, \ xi} L (\ omega, b, \ alpha, \ xi, \ mu) \] dual problem to pull Minimax problem Grange function, therefore first find \ (L (\ omega, b , \ alpha, \ xi, \ mu) \) of \ (\ omega, b, \ xi \) is extremely small, That
\ [\ begin {align} & \ nabla_wL (\ omega, b, \ alpha, \ xi, \ mu) = w- \ sum_ {i = 1} ^ m \ alpha_iy_ix_i = 0 \\ & \ nabla_bL (\ omega , b, \ alpha, \ xi , \ mu) = - \ sum_ {i = 1} ^ m \ alpha_iy_i = 0 \\ & \ nabla _ {\ xi_i} L (\ omega, b, \ alpha, \ xi, \ mu) = C- \ alpha_i- \ mu_i
= 0 \ end {align} \] to give
\ [\ Begin {align} & w = \ sum_ {i = 1} ^ m \ alpha_iy_ix_i \\ & \ sum_ {i = 1} ^ m \ alpha_iy_i = 0 \\ & C- \ alpha_i- \ mu_i = 0 \ end {align} \]
the above three formulas can be obtained progeny into Lagrangian (Note: derivation similar derivation linearly separable support vector machine, not much to say, given a direct result)
\ [\ underbrace {\ min} _ {\ omega, b, \ xi} L (\ omega, b, \ alpha, \ xi, \ mu) = - {\ frac {1} {2}} \ sum_ {i = 1} ^ m \ sum_ {j = 1
} ^ m \ alpha_i \ alpha_jy_iy_j (x_ix_j) + \ sum_ {i = 1} ^ m \ alpha_i \] by \ (\ underbrace {\ min} _ {\ omega, b, \ xi } L (\ omega, b, \ alpha, \ xi, \ mu) \) to give \ (\ Alpha \) is great, i.e. substituting the dual problem of formula to give
\ [\ Begin {align} & \ underbrace {\ max} _ {\ alpha} {- {\ frac {1} {2}} \ sum_ {i = 1} ^ m \ sum_ {j = 1} ^ m \ alpha_i \ alpha_jy_iy_j (x_ix_j) + \ sum_ {i = 1} ^ m \ alpha_i} \\ & st \ quad \ sum_ {i = 1} ^ m \ alpha_iy_i = 0 \\ & \ quad \ quad C- \ alpha_i- \ mu_i = 0 \\ & \ quad \ quad \ alpha_i \ geq0, \ quad i = 1,2, \ ldots, m \\ & \ quad \ quad \ mu_i \ geq0, \ quad i = 1,2, \ ldots , m \ end {align} \
] Since \ (C- \ alpha_i- \ mu_i = 0 \) and \ (\ mu_i \ geq0 \) , i.e. \ (0 \ Leq \ alpha_i \ GEQ {C} \) , then No objective function becomes, i.e., find the minimum value becomes
\ [\ begin {align} & \ underbrace {\ min} _ {\ alpha} {{\ frac {1} {2}} \ sum_ {i = 1} ^ m \ sum_ {j = 1 } ^ m \ alpha_i \ alpha_jy_iy_j (x_ix_j) - \ sum_ {i = 1} ^ m \ alpha_i} \\ & st \ quad \ sum_ {i = 1} ^ m \ alpha_iy_i = 0 \\ & \ quad \ quad 0 \
leq \ alpha_i \ geq {C} \ end {align} \] soft formula interval maximization problem is linear support vector machines, linear and separable support hardware's maximum distance vector problem compared to just one more constraint\ (0 \ Leq \ alpha_i \ GEQ {C} \) , i.e., it may be determined by the formula minimization algorithm SMO corresponding \ (\ Alpha \) , obtained by assuming the SMO algorithm \ (\ Alpha \) value referred to as \ (\ alpha ^ * \) , according to \ (\ alpha ^ * \) to obtain the original solution of the optimization problem \ (\ omega ^ * \) and \ (B ^ * \) .

Since the next derivation and linearly separable support vector machine, like, not repeat them, it is worth to say is yes. "Linearly separable support vector machine" article had said the number of support vector machine you might find how many \ (b ^ * \) , linearly separable support vector machines \ (b ^ * \) are the same , can not handle; linear support vector machine and relaxation factor for the influence of the different support vector \ (b ^ * \) is different averaging method generally used to obtain the final \ (b ^ * \) .

2.4 Support Vector

2.4.1 hard to maximize support vector interval

Support Vector maximize hard interval is relatively simple, i.e., satisfies \ (y_j (w ^ * x_j + b) -1 = 0 \) can. "Linearly separable support vector machine" has an explanation is given by: KKT Complementarity available for \ (\ alpha ^ *> 0 \) of sample points \ ((x_j, y_j) \ ) have \ (y_j (w ^ * x_j B +) -1 0 = \) , and the interval boundaries \ (H_1 \) and \ (H_2 \) are \ (w ^ * x_j + b ^ * = 1 \) and \ (w ^ * x_j + b ^ * = - 1 \) , which is to support vector in a certain interval on the border.

2.4.2 soft margin support vector maximization

Soft margin SVM is maximized due to the introduction of the relaxation factor for each sample \ ((x_i, y_i) \ ) become more complicated. The first \ (I \) samples \ (x_i \) to a distance corresponding to the support vector is \ ({\ FRAC {\ xi_i {}}} || W || \) , the KKT conditions for maximizing the interval based soft the dual complementarity \ ({\ alpha_i} ^ * (y_i (wx_i + b) -1+ \ xi_i) = 0 \) can be obtained

  1. If \ ({\ alpha_i} ^ * = 0 \) , then \ (\ xi_i = 0 \) , i.e. on the interval boundaries or sample has been classified right, i.e., away from the boundary points of all the figures in
  2. If \ (0 <{\ alpha_i} ^ * <C \) , then \ (\ xi_i = 0 \) , support vector \ (x_i \) falls exactly on the interval boundaries, i.e. the two support vector in FIG.
  3. If \ ({\ alpha_i} ^ * = C \) , to illustrate this point it may be an exception, this time to check \ (\ xi_i \) case

    1. If \ (0 <{\ xi_i} <. 1 \) , then the point \ (x_i \) were correctly classified, but between the separating hyperplane and the interval boundaries, i.e., the point 3 in FIG.
    2. If \ (\ xi_i. 1 = \) , then the point \ (x_i \) on the separating hyperplane, i.e., the point can not be properly classified
    3. If \ (\ xi_i> 1 \) , then the point \ (x_i \) located at one side of the separating hyperplane misclassification, that is, the point can not be properly classified, i.e., points 1 and 2 in the FIG.
# 软间隔最大化支持向量图例
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
from sklearn import svm
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

np.random.seed(8)  # 保证数据随机的唯一性

# 构造线性可分数据点
array = np.random.randn(20, 2)
X = np.r_[array-[3, 3], array+[3, 3]]
y = [0]*20+[1]*20

# 建立svm模型
clf = svm.SVC(kernel='linear', C=10.0)
clf.fit(X, y)

# 构造等网个方阵
x1_min, x1_max = X[:, 0].min(), X[:, 0].max(),
x2_min, x2_max = X[:, 1].min(), X[:, 1].max(),
x1, x2 = np.meshgrid(np.linspace(x1_min, x1_max),
                     np.linspace(x2_min, x2_max))

# 得到向量w: w_0x_1+w_1x_2+b=0
w = clf.coef_[0]
# 加1后才可绘制 -1 的等高线 [-1,0,1] + 1 = [0,1,2]
f = w[0]*x1 + w[1]*x2 + clf.intercept_[0] + 1

# 绘制H2,即wx+b=-1
plt.contour(x1, x2, f, [0], colors='k', linestyles='--')

# 绘制分隔超平面,即wx+b=0
plt.contour(x1, x2, f, [1], colors='k')

# 绘制H1,即wx+b=1
plt.contour(x1, x2, f, [2], colors='k', linestyles='--')

# 绘制数据散点图
plt.scatter(X[0:20, 0], X[0:20, 1], cmap=plt.cm.Paired, marker='x')
plt.text(1, 1.8, s='支持向量', fontsize=10, color='gray',
         ha='center', fontproperties=font)

plt.scatter(X[20:40, 0], X[20:40, 1], cmap=plt.cm.Paired, marker='o')
plt.text(-1.5, -0.5, s='支持向量', fontsize=10,
         color='gray', ha='center', fontproperties=font)


# 绘制假设的异常点,**毫无意义的代码**,为了讲解用
plt.annotate(r'${\frac{\xi_1}{||w||}}$', xytext=(0.5, 2.8), xy=(-1.2, -2.2),
             arrowprops=dict(arrowstyle="->", connectionstyle="arc3"), fontsize=20)
plt.scatter(-1.2, -2.2, marker='o', c='orange')

plt.annotate(r'${\frac{\xi_2}{||w||}}$', xytext=(1.8, 1.3), xy=(
    1.2, -1.5), arrowprops=dict(arrowstyle="->", connectionstyle="arc3"), fontsize=20)
plt.scatter(1.2, -1.5, marker='o', c='orange')

plt.annotate(r'${\frac{\xi_3}{||w||}}$', xytext=(3, -0.2), xy=(
    3, -2), arrowprops=dict(arrowstyle="->", connectionstyle="arc3"), fontsize=20)
plt.scatter(3, -2, marker='o', c='orange')


plt.xlim(x1_min-1, x1_max+1)
plt.ylim(x2_min-1, x2_max+1)
plt.show()

png

2.5 hinge loss function

Linear support vector machine there is another explanation, is to minimize the following objective function
\ [\ sum_ {i = 1 } ^ m [1-y_i (wx_i + b)] _ {+} + \ lambda {|| w | |} ^ 2 \]
where \ (\ sum_ {i = 1 } ^ m [1-y_i (wx_i + b)] _ {+} \) is referred to experience losses, either good or bad metric model; \ ( \ lambda \) is the regularization term.

其中函数\(L(y(wx+b))=[1-y_i(wx_i+b)]_{+}\)称为合页损失函数(hinge loss function),下标\(+\)表示为
\[ {[z]}_+ = \begin{cases} z \quad z>0 \\ 0 \quad z\leq0 \end{cases} \]
也就是说,如果点被正确分类,即函数间隔\(y_i(wx_i+b)\)大于1,即\(z\leq0\),损失是0;否则损失是\(1-y_i(wx_i+b)\)。但是对于上图的点3被正确分类但是损失不是0,因此合页损失函数对学习有更高的要求。

其实上述的目标函数等价于软间隔最大化的目标函数,即把\([1-y_i(wx_i+b)]_{+}\)看成\(\xi_i\),即上述的目标函数可以写成
\[ \underbrace{\min}_{\omega,b} \quad \sum_{i=1}^m\xi_i+\lambda{||\omega||}^2 \]
如果取\(\lambda\)\({\frac{1}{2C}}\),则
\[ \underbrace{\min}_{\omega,b} \quad {\frac{1}{C}}({\frac{1}{2}}{||\omega||}^2+C\sum_{i=1}^m\xi_i) \]
软间隔最大化的目标函数为
\[ \underbrace{\min}_{\omega,b} \quad {\frac{1}{2}}{||\omega||}^2+C\sum_{i=1}^m\xi_i \]

三、线性支持向量机流程

3.1 输入

\(m\)个样本的线性可分训练集\(T=\{(x_1,y_1),(x_2,y_2),\cdots,(x_m,y_m)\}\),其中\(x_i\)\(n\)维特征向量,\(y_i\)为二元输出即值为\(1\)或者\(-1\)

3.2 输出

分离超平面的参数\(w^*\)\(b^*\)以及分类决策函数

3.3 流程

  1. 构造约束优化问题为
    \[ \begin{align} & \underbrace{\min}_{\alpha} {\frac{1}{2}} \sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_jy_iy_j(x_ix_j) - \sum_{i=1}^m\alpha_i \\ & s.t. \sum_{i=1}^m \alpha_iy_i =0 \\ & \quad 0\geq\alpha_i\leq{C}, \quad i=1,2,\ldots,m \end{align} \]
  2. 使用SMO算法求出上式最小时对应的\(\alpha^*\)
  3. 计算\(w^*\)
    \[ w^* = \sum_{i=1}^m \alpha^*y_ix_i \]
  4. 找到所有的\(S\)个支持向量,即满足\(0<{\alpha_i}^*<C\)的样本\((x_s,y_s)\),通过\(y_s(\sum_{i=1}^S{\alpha_i}^*y_ix_ix_s+b^*)-1=0\)计算出每个支持向量对应的\(b^*\),计算出这些所有的\(b^*\)的平均值即为最终的\(b^*={\frac{1}{S}}\sum_{i=1}^Sb^*\)
  5. 求得分离超平面为
    \[ w^*x+b^* = 0 \]
  6. 求得分类决策函数为
    \[ f(x) = sign(w^*x+b^*) \]
    线性支持向量机的步骤和线性可分支持向量机的步骤大致相同,由于线性支持向量机使用了松弛因子,两者之间最大的不同点在于对\(b^*\)值的考虑。

四、线性支持向量机优缺点

4.1 优点

  1. 处理了异常点导致的数据集线性不可分问题

4.2 缺点

  1. 只支持二分类问题,对于多分类问题需要使用OvR等其他辅助方法
  2. 只解决了异常点造成的非线性数据的处理,未真正解决非线性数据的分类问题

五、小结

线性支持向量机解决了最初的支持向量机的一个大问题,即无法处理原始问题,异常值会给模型的优化带来极大的麻烦,即会使得模型的最大间隔变窄,影响模型的性能。

虽然线性支持向量机支持这种异常值导致的数据集线性不可分,但是他并没有从本质上解决线性支持向量机无法处理线性不可分的问题,接下来的核函数的使用将让支持向量机变成分类器模型中的王者,即非线性支持向量机。

Guess you like

Origin www.cnblogs.com/nickchen121/p/11686749.html