Python layman's language support vector machine (SVM) algorithm

Compared to logistic regression, in many cases, the SVM algorithm can calculate the data resulting in better accuracy. The traditional two-class SVM is only applicable to the operation, but it can be through nuclear techniques (kernel), so that the SVM classification can be applied to multi-task.

This article only describes the principles and core skills SVM exactly how it will eventually introduce the role of the various parameters and content of a demo sklearn svm combat, as accessibly. As for the formula derivation aspect, online articles about this area too much, here is not to be launched -

1.SVM Profile

Support vector machine, capable of N-dimensional plane, find a hyperplane most obvious was to categorize data! See the following figure depicts:

As the figure above, in the two-dimensional plane, there are two kinds of red and blue points. To classify these two types of points, you can have a variety classification method, just as in FIG plurality of the green line, the data can be divided into two parts.

But SVM do is find the best that line (two-dimensional space), or the super-plane (higher dimensional space), to classify the data. The best standard is the greatest distance .

As to how to find the maximum spacing, to find the maximum distance, say something here about simple, two categories of data, to the hyper-plane distance and called the interval . And to do it is to find the maximum interval.

This eventually becomes an optimization problem to maximize the interval.

2.SVM nuclear skills

Nuclear techniques, mainly to solve linear SVM multi-classification can not be the case and in the case of certain linear SVM Impossible classification.

Such data such as the following:

Nuclear techniques Data

This time we can use kernel function to convert the data look like this, we manually define a new point, then all the data, Euclidean distance is calculated and this new point, so that we get a new data . Which, from this new data point from the past, it was classified as a class, or is another class. This is the kernel function.

Gaussian kernel

This is the most superficial, is relatively straightforward introduced. Through the above description, is not it a bit like it and Sigmoid? The data are used by a conversion function, and ultimately get the results, in fact, ah, Sigmoid kernel function is a bell was it, and that way said above, it is the Gaussian kernel.

Add a few points here:

  • 1.上面的图中只有一个点,实际可以有无限多个点,这就是为什么说SVM可以将数据映射到多维空间中。计算一个点的距离就是1维,2个点就是二维,3个点就是三维等等。。。
  • 2.上面例子中的红点是直接手动指定,实际情况中可没办法这样,通常是用随机产生,再慢慢试出最好的点。
  • 3.上面举例这种情况属于高斯核函数,而实际常见的核函数还有多项式核函数,Sigmoid核函数等等。

OK,以上就是关于核技巧(核函数)的初步介绍,更高级的这里也不展开了,网上的教程已经非常多了。

接下来我们继续介绍sklearn中SVM的应用方面内容。

3.sklearn中SVM的参数

def SVC(C=1.0, 
             kernel='rbf', 
             degree=3, 
             gamma='auto_deprecated',
             coef0=0.0, 
             shrinking=True, 
             probability=False,
             tol=1e-3, 
             cache_size=200, 
             class_weight=None,
             verbose=False, 
             max_iter=-1, 
             decision_function_shape='ovr',
             random_state=None)
 
- C:类似于Logistic regression中的正则化系数,必须为正的浮点数,默认为 1.0,这个值越小,说明正则化效果越强。换句话说,这个值越小,越训练的模型更泛化,但也更容易欠拟合。
- kernel:核函数选择,比较复杂,稍后介绍
- degree:多项式阶数,仅在核函数选择多项式(即“poly”)的时候才生效,int类型,默认为3。
- gamma:核函数系数,仅在核函数为高斯核,多项式核,Sigmoid核(即“rbf“,“poly“ ,“sigmoid“)时生效。float类型,默认为“auto”(即值为 1 / n_features)。
- coef0:核函数的独立项,仅在核函数为多项式核核Sigmoid核(即“poly“ ,“sigmoid“)时生效。float类型,默认为0.0。独立项就是常数项。
- shrinking:不断缩小的启发式方法可以加快优化速度。 就像在FAQ中说的那样,它们有时会有所帮助,有时却没有帮助。 我认为这是运行时问题,而不是收敛问题。
- probability:是否使用概率评估,布尔类型,默认为False。开启的话会评估数据到每个分类的概率,不过这个会使用到较多的计算资源,慎用!!
- tol:停止迭代求解的阈值,单精度类型,默认为1e-3。逻辑回归也有这样的一个参数,功能都是一样的。
- cache_size:指定使用多少内存来运行,浮点型,默认200,单位是MB。
- class_weight:分类权重,也是和逻辑回归的一样,我直接就搬当时的内容了:分类权重,可以是一个dict(字典类型),也可以是一个字符串"balanced"字符串。默认是None,也就是不做任何处理,而"balanced"则会去自动计算权重,分类越多的类,权重越低,反之权重越高。也可以自己输出一个字典,比如一个 0/1 的二元分类,可以传入{0:0.1,1:0.9},这样 0 这个分类的权重是0.1,1这个分类的权重是0.9。这样的目的是因为有些分类问题,样本极端不平衡,比如网络攻击,大部分正常流量,小部分攻击流量,但攻击流量非常重要,需要有效识别,这时候就可以设置权重这个参数。
- verbose:输出详细过程,int类型,默认为0(不输出)。当大于等于1时,输出训练的详细过程。仅当"solvers"参数设置为"liblinear"和"lbfgs"时有效。
- max_iter:最大迭代次数,int类型,默认-1(即无限制)。注意前面也有一个tol迭代限制,但这个max_iter的优先级是比它高的,也就如果限制了这个参数,那是不会去管tol这个参数的。
- decision_function_shape:多分类的方案选择,有“ovo”,“ovr”两种方案,也可以选则“None”,默认是“ovr”,详细区别见下面。
- random_state:随时数种子。

sklearn-SVM参数,kernel特征选择

kernel:核函数选择,字符串类型,可选的有“linear”,“poly”,“rbf”,“sigmoid”,“precomputed”以及自定义的核函数,默认选择是“rbf”。各个核函数介绍如下:
“linear”:线性核函数,最基础的核函数,计算速度较快,但无法将数据从低维度演化到高维度
“poly”:多项式核函数,依靠提升维度使得原本线性不可分的数据变得线性可分
“rbf”:高斯核函数,这个可以映射到无限维度,缺点是计算量比较大
“sigmoid”:Sigmoid核函数,对,就是逻辑回归里面的那个Sigmoid函数,使用Sigmoid的话,其实就类似使用一个一层的神经网络
“precomputed”:提供已经计算好的核函数矩阵,sklearn不会再去计算,这个应该不常用
“自定义核函数”:sklearn会使用提供的核函数来进行计算
说这么多,那么给个不大严谨的推荐吧
样本多,特征多,二分类,选择线性核函数
样本多,特征多,多分类,多项式核函数
样本不多,特征多,二分类/多分类,高斯核函数
样本不多,特征不多,二分类/多分类,高斯核函数

当然,正常情况下,一般都是用交叉验证来选择特征,上面所说只是一个较为粗浅的推荐。

sklearn-SVM参数,多分类方案

其实这个在逻辑回归里面已经有说过了,这里还是多说一下。

原始的SVM是基于二分类的,但有些需求肯定是需要多分类。那么有没有办法让SVM实现多分类呢?那肯定是有的,还不止一种。

实际上二元分类问题很容易推广到多元逻辑回归。比如总是认为某种类型为正值,其余为0值

For example, to be classified as A, B, C categories, it can be put forward as the data A, B and C as negative data to deal with, so you can use the two-class method to solve the problem of multi-classification this method is the most commonly used one-vs-rest, referred OvR. But this method can also be easily extended to get the other two classification model (of course, other algorithms may have better multi-classification).

Another classification scheme is a multi-Many-vs-Many (MvM) , it selects another portion of the sample in the sample and the category of the categories of the classification do two .

It sounds incredible, but in fact can indeed be done. Such data have A, B, C three categories.

We will be A, B as the forward data, C as negative data points to train a model. Then A, C as the forward data, B data as negative, train a classification model. Finally, B, C as positive data, C as negative data, a model train.

Through these three models will be able to achieve multi-classification, and of course here is just an example, actual use there are better methods of MVM. As space is limited here not started.

The most commonly used is MVM One-Vs-One (OvO). OvO MvM is a special case. I.e., two samples each selection do binary logistic regression.

Comparative both multiple classification methods, usually, OVR is relatively simple, relatively fast speed, but not so high MvM on model accuracy. MvM is just the opposite, high precision, but however Ovr than speed.

4.sklearn SVM combat

We still use the iris data set, but this time using only one of the two kinds of flowers to classify. First prepare the data:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import svm,datasets
import pandas as pd
tem_X = iris.data[:, :2]
tem_Y = iris.target
new_data = pd.DataFrame(np.column_stack([tem_X,tem_Y]))
#过滤掉其中一种类型的花
new_data = new_data[new_data[2] != 1.0]
#生成X和Y
X = new_data[[0,1]].values
Y = new_data[[2]].values

Then the training data, and generate the final graphics


# 拟合一个SVM模型
clf = svm.SVC(kernel='linear')
clf.fit(X, Y)

# 获取分割超平面
w = clf.coef_[0]
# 斜率
a = -w[0] / w[1]
# 从-5到5,顺序间隔采样50个样本,默认是num=50
# xx = np.linspace(-5, 5)  # , num=50)
xx = np.linspace(-2, 10)  # , num=50)
# 二维的直线方程
yy = a * xx - (clf.intercept_[0]) / w[1]
print("yy=", yy)

# plot the parallels to the separating hyperplane that pass through the support vectors
# 通过支持向量绘制分割超平面
print("support_vectors_=", clf.support_vectors_)
b = clf.support_vectors_[0]
yy_down = a * xx + (b[1] - a * b[0])
b = clf.support_vectors_[-1]
yy_up = a * xx + (b[1] - a * b[0])

# plot the line, the points, and the nearest vectors to the plane
plt.plot(xx, yy, 'k-')
plt.plot(xx, yy_down, 'k--')
plt.plot(xx, yy_up, 'k--')

plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=80, facecolors='none')


plt.scatter(X[:, 0].flat, X[:, 1].flat, c='#86c6ec', cmap=plt.cm.Paired)
# import operator
# from functools import reduce
# plt.scatter(X[:, 0].flat, X[:, 1].flat, c=reduce(operator.add, Y), cmap=plt.cm.Paired)

plt.axis('tight')
plt.show()

The final SVM classification results are as follows:
Iris classification results

Above ~

Guess you like

Origin www.cnblogs.com/listenfwind/p/11919487.html