Support vector machines machine learning principles and practice sklearn

1. Scene Description

Question: How inseparable data set linearly separable data sets and linear in the figure below are classified?

Ideas:

  • (1) to find an optimal hyperplane divided linearly separable sets of data
  • (2) a linearly inseparable data set into linearly separable sets of data by some method

With these two issues will now summarize the relevant issues SVM

2. How to find the optimal segmentation hyperplane

In general, when a linearly separable training set, there are an infinite number separating hyperplane data types can be separated correctly, such as perceptual separating hyperplane obtained have an infinite number of machine, in order to obtain unique optimal separating hyperplane plane, we need to maximize the use of support vector machine based

2.1 categorical predictor certainty

Figure above, there are A, B, C three points represent three examples are based on the positive side of the separation hyperplane, a hyperplane separating point A from the far point when the positive class prediction, the prediction is more sure It is correct; hyperplane near distance point C, the point when the prediction is positive type, not so sure; point B between points a and C, a and C are predicted to as the certainty factor of a positive type between

When the above description, when a training set of all data points are far enough away from the dividing plane, the greater degree of certainty. Hyperplane \ (w ^ TX + b = 0 \) a case where the determination, the data point may be determined by the function of the separation distance and the geometric distance from the dividing hyperplane

2.2 function interval

For a given training set T and the hyperplane (w, b), the definition of a hyperplane (w, b) about the sample \ ((x_i, y_i) \ ) function interval: \ [\ overline {\ Gamma {_i }} = y_i (w \ bullet {x_i} + b) \]

Defined hyperplane (w, b) a function of the training data interval T is set hyperplane (w, b) on all sample points T \ ((x_i, y_i) \ ) the minimum value of the interval function: \ [\ overline {\ gamma} = \ min \ limits_ {i = 1, ..., N} \ overline {\ gamma {_i}} \]

Interval function can be expressed categorical predictor of accuracy and certainty, but the choice of when hyperplane separation, only the function of the separation is not enough

2.3 geometric spacing

For a given training set T and the hyperplane (w, b), the definition of a hyperplane (w, b) about the sample point \ ((x_i, y_i) \ ) geometric interval: \ [\ Gamma {_i} = \ frac {y_i (w \ bullet {x_i} + b)} {|| w ||} \]

Defined hyperplane (w, b) on the geometry of the training data set interval T hyperplane (w, b) on all sample points T \ ((x_i, y_i) \ ) the minimum value of geometrical interval: \ [\ Gamma = \ min \ limits_ {i = 1, ..., N} \ gamma_i \]

The relationship between the geometry and function of the separation distance 2.4

From the above-defined function of the separation distance and geometry can be obtained the relationship between the function of the separation distance and geometry: \ [\ gamma_i = \ FRAC {\ overline {\ gamma_i {}}} || W || \]

\[\gamma = \frac{\overline{\gamma}}{||w||}\]

2.5 Hard maximize interval separating hyperplane

The basic idea of ​​support vector machine learning is to be able to find the correct division of the training data set and geometric maximum interval separating hyperplane, in other words is an example not only to separate the positive and negative points, and point to examples of the most difficult points (hyperplane recently from point) has large enough degree of certainty to separate them, hard interval is described later with spacing corresponding to the soft

How to obtain a geometric interval separating hyperplane is maximized, it can be expressed as the following constraint optimization problem: \ [\ max \ {limits_ W, B} \ Quad \ Gamma \]

\[s_.t.\quad\frac{y_i(w\bullet{x_i}+b)}{||w||}\geq\gamma,\quad{i=1,2,...,N}\]

The relationship between the above function of the separation distance and geometry is converted into equally following constraints: \ [\ max \ {limits_ W, B} \ Quad \ FRAC {\ overline {\ Gamma} {}} || W || \]

\[s_.t.\quad\ y_i(w\bullet{x_i}+b)\geq\overline{\gamma},\quad{i=1,2,...,N}\]

Since when w, b according to the time scaling as a function of the interval \ (\ overline \ gamma \) will be proportional to the change, first take \ (\ overline \ Gamma =. 1 \) , then since the \ (\ frac {1} { | | w ||} \) maximize and minimize \ (\ frac {1} { 2} {|| w ||} ^ 2 \) are equivalent, thereby obtaining: \ [\ min \ {limits_ W, b} \ quad \ frac {1 } {2} {|| w || ^ 2} \]

\[s_.t.\quad\ y_i(w\bullet{x_i}+b)\geq 1,\quad{i=1,2,...,N}\]

Whereby separating hyperplane: \ [^ {*} W \ bullet X + B *} = 0 {^ \]

Category Decision Function: \ [F (X) = Sign (W * ^ {} \ bullet * X + B ^ {}) \]

Solving Lagrangian dual function: \ [L (W, B, A) = \ {FRAC. 1} {2} {} || W || ^ 2 - \ sum_ = {I}. 1 ^ na_i [(y_i ( x_iw + b) -1)] ----
(1) \] for the partial derivative w: \ [\ FRAC {\ partial L} {\ partial} w = w - \ sum_. 1 = {I} = ^ na_iy_ix_i 0 ----- (2) \]
for the partial derivative b: \ [\ FRAC {\ partial L} {\ partial} b = \ sum_ {I}. 1 = ^ 0 = ------- na_iy_i (3) \]
the (2) (3) into (1) to give: \ [maxL (A) = - \ FRAC {1} {2} \ sum_ I = {n-1} ^ \ sum_ {J 1 = } ^ na_ia_jy_iy_jx_ix_j + \ sum_ {i = 1} ^ na_i \]

\[s.t. \quad \sum_{i=1}^na_iy_i = 0\]

\[a_i >= 0\]

2.6 maximize soft spacer separating hyperplane

For linearly separable sets of data may be used as a hard spacer hyperplane to maximize divided, but for some sample points linearly inseparable function can not satisfy the constraint intervals greater than or equal to 1, in order to solve this problem, for each sample point \ ((x_i, y_i) \ ) introducing a slack variable \ (\ XI> = 0 \) , so that the function equal interval plus a slack variable is greater than 1, this constraint becomes: \ [Yi (W \ bullet x_i + B )> = 1- \ xi_ {i } \]

Meanwhile, the slack variable for each \ (\ xi_ {i} \ ) pay a consideration \ (\ xi_ {i}, the original objective function \) \ (\ FRAC. 1} {2} {{|| W || } ^ 2 \) \ (becomes \) \ (\ FRAC. 1} {2} {} {|| W || ^ 2 + C \ sum_. 1} ^ {n-I = {\ xi_i} \)

C is the penalty coefficient, it is generally determined by the application of penalties for misclassification increases when large C value, C value for misclassification punish small hours

Linear Programming problems inseparable linear support vector machine learning convex quadratic programming problem is as follows: \ [\ min \ {limits_ W, B, \ XI} \ Quad \ {FRAC. 1} {2} {2 ^ || W || } + C \ sum_ {i = 1} ^ n {\ xi_i} \]

\[s_.t.\quad\ y_i(w\bullet{x_i}+b)\geq 1 - \xi_{i},\quad{i=1,2,...,N}\]

\[\xi_{i} >= 0,\quad i = 1,2,...,N\]

Whereby separating hyperplane: \ [^ {*} W \ bullet X + B *} = 0 {^ \]

Category Decision Function: \ [F (X) = Sign (W * ^ {} \ bullet * X + B ^ {}) \]

Lagrangian dual function:
\ [maxL (A) = - \ FRAC. 1} {2} {\ sum_. 1} ^ {n-I = \ sum_. 1} = {J ^ na_ia_jy_iy_jx_ix_j + \ sum_. 1} = {I ^ na_i \]

\[s.t. \quad \sum_{i=1}^na_iy_i = 0\]

\[a_i >= 0\]

\[\mu_i >= 0\]

\[C-a_i-\mu_i = 0\]

2.7 SVM and interval boundaries

Linear separable case, the training data set sample points in the example of separating hyperplane closest sample point is called the support vector , the support vector constraint is that the establishment of a point, i.e. \ [\ quad \ y_i (w \ bullet {x_i} + b) - 1 = 0 \] or \ [Yi (W \ bullet x_i + B) - (l- \ xi_ {I}) = 0 \] , in \ (y_i = +1 \) positive points embodiment, support vector hyperplane \ [H_1: w ^ Tx + b = \ 1] on, to \ (y_i = -1 \) negative points embodiment, support vector hyperplane \ [H_2: w ^ T x + b = -1 \] , the case \ (H_1 \) and \ (H_2 \) parallel to, and between them there is no point falls example, in \ (H_1 \) and \ (H_2 \) is formed between the a long belt, the separating hyperplane are parallel and located therebetween, \ (H_1 and H_2 \) as the distance between intervals, the interval depends on the method of dividing hyperplane vector \ (W \) , is equal to \ (\ frac { } {2 | W |} \) , \ (H_1 and H_2 \) is the interval boundaries , as shown below:

在决定分离超平面时只有支持向量起作用,而其他实例点并不起作用。如果移动支持向量将改变所求的解;但是如果在间隔边界以外移动其他实例点,甚至去掉这些点,则解是不会变的,由于支持向量在确定分离超平面中起着决定性的作用,所以将这种分类称为支持向量机。支持向量的个数一般很少,所以支持向量机由很少的‘很重要的’训练样本确定

3. 如何将线性不可分数据集转换为线性可分数据集

3.1 数据线性不可分的原因

(1) 数据集本身就是线性不可分隔的

(2) 数据集中存在噪声,或者人工对数据赋予分类标签出错等情况的原因导致数据集线性不可分

3.2 常用方法

将线性不可分数据集转换为线性可分数据集常用方法:

对于原因(2)

  • 需要修正模型,加上惩罚系数C,修正后的模型,可以“容忍”模型错误分类的情况,并且通过惩罚系数的约束,使得模型错误分类的情况尽可能合理

对于原因(1)

  • (1)通过相似函数添加相似特征
  • (2)使用核函数(多项式核、高斯RBF核),将原本的低维特征空间映射到一个更高维的特征空间,从而使得数据集线性可分

3.3 核技巧在支持向量机中的应用

注意到在线性支持向量机的对偶问题中,无论是目标函数还是决策函数都只涉及输入实例与实例之间的内积,在对偶问题的目标函数中的内积\(x_ix_j\)可以用核函数\[K(x_i,x_j) = \phi (x_i)\bullet \phi(x_j)\]代替,此时对偶问题的目标函数成为\[maxL(a) = -\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^na_ia_jy_iy_jK(x_i,x_j) + \sum_{i=1}^na_i\]

同样,分类决策函数中的内积也可以用核函数代替\[f(x) = sign(\sum_{i=1}^na_i^*y_iK(x_i,x)+b^*)\]

4 使用sklearn框架训练svm

  • SVM特别适用于小型复杂数据集,samples < 100k
  • 硬间隔分类有两个主要的问题:
    • (1) 必须要线性可分
    • (2) 对异常值特别敏感,会导致不能很好的泛化或无法找不出硬间隔
  • 使用软间隔分类可以解决硬间隔分类的两个主要问题,尽可能保存街道宽敞和限制间隔违例(即位于街道之上,甚至在错误一边的实例)之间找到良好的平衡
  • 在Sklean的SVM类中,可以通过超参数C来控制这个平衡,C值越小,则街道越宽,但是违例会越多,如果SVM模型过度拟合,可以试试通过降低C来进行正则化

4.1 线性可分LinearSVC类

4.1.1 LinearSVC类重要参数说明

  • penalty: string,'l1'or'l2',default='l2'
  • loss: string 'hing'or'squared_hinge',default='squared_hinge',hinge为标准的SVM损失函数
  • dual: bool,defalut=True,wen n_samples > n_features,dual=False,SVM的原始问题和对偶问题二者解相同
  • tol: float,deafult=le-4,用于提前停止标准
  • C: float,defult=1.0,为松弛变量的惩罚系数
  • multi_class: 默认为ovr,该参数不用修改
  • 更多说明应查看源码

4.1.2 Hinge损失函数

函数max(0,1-t),当t>=1时,函数等于0,如果t<1,其导数为-1
def hinge(x):
    if x >=1 :
        return 0
    else:
        return 1-x
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-2,4,20)
y = [hinge(i) for i in x ]
ax = plt.subplot(111)
plt.ylim([-1,2])
ax.plot(x,y,'r-')
plt.text(0.5,1.5,r'f(t) = max(0,1-t)',fontsize=20)
plt.show()
<Figure size 640x480 with 1 Axes>

4.1.3 LinearSVC实例

from sklearn import datasets
import pandas as pd
iris = datasets.load_iris()
print(iris.keys())
print('labels:',iris['target_names'])
features,labels = iris['data'],iris['target']
print(features.shape,labels.shape)

# 分析数据集
print('-------feature_names:',iris['feature_names'])
iris_df = pd.DataFrame(features)
print('-------info:',iris_df.info())
print('--------descibe:',iris_df.describe())
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
labels: ['setosa' 'versicolor' 'virginica']
(150, 4) (150,)
-------feature_names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
0    150 non-null float64
1    150 non-null float64
2    150 non-null float64
3    150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB
-------info: None
--------descibe:                 0           1           2           3
count  150.000000  150.000000  150.000000  150.000000
mean     5.843333    3.057333    3.758000    1.199333
std      0.828066    0.435866    1.765298    0.762238
min      4.300000    2.000000    1.000000    0.100000
25%      5.100000    2.800000    1.600000    0.300000
50%      5.800000    3.000000    4.350000    1.300000
75%      6.400000    3.300000    5.100000    1.800000
max      7.900000    4.400000    6.900000    2.500000
# 数据进行预处理
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import LinearSVC
from scipy.stats import uniform

# 对数据进行标准化
scaler = StandardScaler()
X = scaler.fit_transform(features)
print(X.mean(axis=0))
print(X.std(axis=0))
# 对标签进行编码
encoder = LabelEncoder()
Y = encoder.fit_transform(labels)

# 调参
svc = LinearSVC(loss='hinge',dual=True)
param_distributions = {'C':uniform(0,10)}
rscv_clf =RandomizedSearchCV(estimator=svc, param_distributions=param_distributions,cv=3,n_iter=20,verbose=2)
rscv_clf.fit(X,Y)
print(rscv_clf.best_params_)
[-1.69031455e-15 -1.84297022e-15 -1.69864123e-15 -1.40924309e-15]
[1. 1. 1. 1.]
Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV] C=8.266733168092582 .............................................
[CV] .............................. C=8.266733168092582, total=   0.0s
[CV] C=8.266733168092582 .............................................
[CV] .............................. C=8.266733168092582, total=   0.0s
[CV] C=8.266733168092582 .............................................
[CV] .............................. C=8.266733168092582, total=   0.0s
[CV] C=8.140498369662586 .............................................
[CV] .............................. C=8.140498369662586, total=   0.0s
...
...
...
[CV] .............................. C=9.445168322251103, total=   0.0s
[CV] C=9.445168322251103 .............................................
[CV] .............................. C=9.445168322251103, total=   0.0s
[CV] C=2.100443613273717 .............................................
[CV] .............................. C=2.100443613273717, total=   0.0s
[CV] C=2.100443613273717 .............................................
[CV] .............................. C=2.100443613273717, total=   0.0s
[CV] C=2.100443613273717 .............................................
[CV] .............................. C=2.100443613273717, total=   0.0s
{'C': 3.2357870215300046}
# 模型评估
y_prab = rscv_clf.predict(X)
result = np.equal(y_prab,Y).astype(np.float32)
print('accuracy:',np.sum(result)/len(result))
accuracy: 0.9466666666666667
from sklearn.metrics import accuracy_score,precision_score,recall_score

print('accracy_score:',accuracy_score(y_prab,Y))
print('precision_score:',precision_score(y_prab,Y,average='micro'))
accracy_score: 0.9466666666666667
precision_score: 0.9466666666666667

5 附录

5.1 非线性SVM分类SVC

SVC类通过参数kernel的设置可以实现线性和非线性分类,具体参数说明和属性说明如下

5.1.1 SVC类参数说明

  • C: 惩罚系数,float,default=1.0
  • kernel: string,default='rbf',核函数选择,必须为('linear','poly','rbf','sigmoid','precomputed' or callable)其中一个
  • degree: 只有当kernel='poly'时才有意义,表示多项式核的深度
  • gamma: float,default='auto',核系数
  • coef0,: float, optional (default=0.0),Independent term in kernel function,It is only significant in 'poly' and 'sigmoid',影响模型受高阶多项式还是低阶多项式影响的结果
  • shrinking: bool,default=True
  • probability: bool,default=False
  • tol: 提前停止参数
  • cache_size:
  • class_weight: 类标签权重
  • verbose: 日志输出类型
  • max_iter: 最大迭代次数
  • decision_function_shape: ‘ovo’,'ovr',defalut='ovr'
  • random_state:

5.1.2 SVC类属性说明

  • support_:
  • support_vectors_:
  • n_support_:
  • dual_coef_:
  • coef_:
  • intercept_:
  • fit_status_:
  • probA_:
  • probB_:

5.1.3 核函数选择

有这么多核函数,该如何决定使用哪一个呢?有一个经验法则是,永远先从线性核函数开始尝试(LinearSVC比SVC(kernel='linear')快的多),特别是训练集非常大或特征非常多的时候,如果训练集不太大,可以试试高斯RBF核,大多数情况下它都非常好用。如果有多余时间和精力,可以使用交叉验证核网格搜索来尝试一些其他的核函数,特别是那些专门针对你数据集数据结构的和函数

5.2 GridSearchCV类说明

5.2.1 GridSearchCV参数说明

  • estimator: 估算器,继承于BaseEstimator
  • param_grid: dict,键为参数名,值为该参数需要测试值选项
  • scoring: default=None
  • fit_params:
  • n_jobs: 设置要并行运行的作业数,取值为None或1,None表示1 job,1表示all processors,default=None
  • cv: 交叉验证的策略数,None或integer,None表示默认3-fold, integer指定“(分层)KFold”中的折叠数
  • verbose: 输出日志类型

5.2.2 GridSearchCV属性说明

  • cv_results_: dict of numpy(masked) ndarray
  • best_estimator_:
  • best_score_: Mean cross-validated score of the best_estimator
  • best_params_:
  • best_index_: int,The index (of the ``cv_results_`` arrays) which corresponds to the best candidate parameter setting
  • scorer_:
  • n_splits_: The number of cross-validation splits (folds/iterations)
  • refit_time: float

5.3 RandomizedSearchCV类说明

5.3.1 RandomizedSearchCV参数说明

  • estimator: 估算器,继承于BaseEstimator
  • param_distributions: dict,键为参数名,Dictionary with parameters names (string) as keys and distributions or lists of parameters to try. Distributions must provide a ``rvs`` method for sampling (such as those from scipy.stats.distributions). If a list is given, it is sampled uniformly
  • n_iter: 采样次数,default=10
  • scoring: default=None
  • fit_params:
  • n_jobs: 设置要并行运行的作业数,取值为None或1,None表示1 job,1表示all processors,default=None
  • cv: 交叉验证的策略数,None或integer,None表示默认3-fold, integer指定“(分层)KFold”中的折叠数
  • verbose: 输出日志类型

5.3.2 RandomizedSearchCV属性说明

  • cv_results_: dict of numpy(masked) ndarray
  • best_estimators_:
  • best_score_: Mean cross-validated score of the best_estimator
  • best_params_:
  • best_index_: int,The index (of the ``cv_results_`` arrays) which corresponds to the best candidate parameter setting
  • scorer_:
  • n_splits_: The number of cross-validation splits (folds/iterations)
  • refit_time: float

参考资料:

Guess you like

Origin www.cnblogs.com/xiaobingqianrui/p/11107042.html