Machine Learning threesome (Series VII) ---- Practice Guidelines support vector machine (with code)

Disclaimer: This article is a blogger original article, shall not be reproduced without the bloggers allowed. https://blog.csdn.net/x454045816/article/details/79009864
Original link: Read the original text
Welcome to the concern of public micro-channel number "smart algorithm", we learn together and progress together.
For the previous series, please reply "in the number of public machine learning " to view!
Through the study of six series, we learned logistic regression algorithm, poke details below link:
No public concern "smart algorithm" we can learn a whole series of articles together.
End of the article to view this article's code key, public key number to reply download code.
In fact, today to talk about logistic regression algorithm and support vector machine is somewhat similar, they are developed from perceptron, support vector machine is a very powerful and very broad application of machine learning algorithms, capable of linear classifiers, non-linear classifier, linear regression, nonlinear regression problems, and even apply outlier detection, it should be one of the most widely used machine learning algorithms, this analysis support vector machine in practice.


A linear support vector machine
We have some figures to explain the basic principles of support vector machine, the next figure is classified iris data set, you can find two kinds of flowers can be easily carved out by a straight line, because the data is linearly separable set, left three possible classification, the dotted line basically no way to divide two categories, the other two touches can successfully divide, but their decision boundary too close distance instance, be used to predict if a new instance, then certainly poor performance. Instead the decision boundary to the right of the figure is a support vector machine learning to the solid line in the figure not only successfully carved out two categories, and recent examples distance away as much as possible. SVM can be considered to find the distance between the closest examples of their decision-making farthest boundary, support vector machines, also known as maximum margin classifiers.
Notes that the figure above, if adding a large number of training examples in a long distance and local decision boundary will not affect the decision boundary, the decision is entirely up to the boundary of different categories from the nearest edge of several examples of that circle the image above part, these examples are called support vectors.


二、软间隔最大化
如果所有实例能够严格的分布在分离间隔的两侧,此时的支持向量机是根据硬间隔最大化,上面讨论的支持向量机就是这种,那么这里涉及两个问题:1,硬间隔分类器只在线性可分情况下可用,2、对离群值比较敏感。如下图中:
左图中多了一个离群值,此时没有办法找到硬间隔,因此没办法使用上面的支持向量机,右图由于出现一个离群值导致决策边界右移,严重影响了模型的泛化能力。
为了避免上面说的问题,我们需要一个更加灵活的模型,能够在最大间隔和误分类点之间寻找一个平衡,从而使得数据集变得可分类。因此获取的支持向量机是根据软间隔最大化计算的。
在Skicit-learn中,支持向量机是通过超参数C,来控制这个平衡的,C越小,能够获取到更宽的的分类间隔,但是有很多的误分类点,下图中就展示了在线性不可分的数据集上,两种不同软间隔支持向量机。
左图是使用较小的超参数C计算的支持向量机,决策间隔较大,但是出现很多的样本在分类间隔中导致不可分,右图是使用较大的参数C计算的支持向量机,得到了较小的决策间隔,较少的误分类点。
另外,如果你的训练的软间隔支持向量机出现了过拟合情况,也可以通过降低超参数C的值来达到正则化的目的。
下面我们通过支持向量机的对Iris进行分类。
svm_clf.predict([[5.5, 1.7]]),得到分类的结果为1。这里上节讲解的逻辑回归不同,并不会输出预测概率,而是直接输出预测分类值。你还可以使用scikit-learn的SVC(kernel="linear",C=1)来将支持向量机模型,但是比较的慢,特别是在数据集比较大时,因此并不推荐,另外还有SGDClassifier在建支持向量机,通过设置SGDClassifier(loss="hinge",alpha=1/(m*C)),这是使用系列五中正则化的随机梯度下降方法来训练一个线性支持向量机,SGDClassifier训练的支持向量机虽然没有LinearSVC收敛的速度快,但是在处理大型数据集(特别是电脑的配置无法训练的数据集)和在线分类任务上很有用。


三、线性不可分支持向量机
虽然线性支持向量机非常的高效,而且在很多数据集上效果出奇的好,但是有很多的数据集并不是线性可分的,而且加入超参数C之后,效果还是很差。一种处理这种非线性数据集的方法是加入更多的特征,比如多项式特征,在一些情况下,可以将非线性数据集变成线性可分。如下图中,左图是一个简单的数据集中只有一个特征X1,这个数据集并不是线性可分的,但是我们加入第二个特征X2,X2=(X1*X1)之后,变成一个2维的数据集了,此时数据集变成线性可分了。
因此我们对非线性数据集训练支持向量机之前,加入一个多项式的特征。我们通过一个实例来看一下怎么应用,下图是一个月牙形的数据集。通过from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.15, random_state=42)产生。
很明显数据集是线性不可分的情况,我们来看加入一个多项式之后再训练支持向量机。
分类之后的结果如下:
在训练之前加入一些多项式特征是一个简单但是高效的特征处理方法,在各种机器学习算法都是有效果的,但是一个低次多项式并不能处理一些非常复杂的数据集,而高次多项式生成的大量特征会使得训练出来的模型非常的慢。幸运的是,我们通过支持向量机中核技巧来解决这个问题,通过核技巧加入多项式特征不仅能够得到和上面一样的结果,甚至是高次的多项式。下面来试一下。
代码中degree是选择多项式的幂次,coef0是控制通过多项式对模型的影响程度,即下图中的r。下面我们通过加入3次的多项式和10次的多项式,来看看它们分别效果。
很显然,右图出现了过拟合现象,可以尝试降低多项式的幂次,相反如果出现欠拟合,可以适当的加大多项式的幂次。因此最优的幂次和r,可以通过网格搜索的方式寻找。


四、高斯核函数
另外一个处理非线性问题的方式是,使用一个相似性函数,计算每个实例和选定的标识的相似度作为特征加入到训练集中。例如:还是选用前面使用的只有一维特征X1的数据集,在x1 = -2和x1 = 1处加入两个标识,接下来定义一个相似性函数,高斯核函数,选定γ = 0.3。
高斯核函数能够比较x和 ℓ的相似度,并映射到0到1。然后我们来计算新特征,例如,实例x1 = -1处,距离第一个标识距离是1,距离第二个标识是2,因此新的特征,x 2 = exp (–0.3 × 1*1) ≈ 0.74和 x 3 = exp (–0.3 × 2*2 ) ≈ 0.30。实例由x1 = -1,转变成了(0.74,0.3)上面了。因此将所有点转换之后变成下图的右图,数据集变得线性可分。
但是,上面的标识应该如何选取呢,一个简单的方法是对数据集上的每个实例都创建一个标识,这会创建很多维度,也增加了转换之后数据集的线性可分的概率。这个方法的缺点是,如果一个训练集有m个实例,n个特征,转换之后的数据集就有m个实例,m个特征。一旦训练集非常大,会导致出现很大数量的特征,增加计算的时间。
And characterized in the same way as the polynomial, support vector machine algorithm is also used to do a Gaussian kernel, or the use of crescent dataset, using the following method:
Different effects can be acquired by setting different classification gamma and C.
As before gamma and C like regularization parameters, if there have been fit or less fit, you can adjust the values ​​of these two parameters to achieve optimal results.


V. regression
The beginning of the article, we mentioned that SVM can solve the problem of linear regression and nonlinear regression problems. The method used by the conversion target trained our model. Not like the Regression classification problem, where as little as possible to ensure misclassification, looking for the maximum free distance between the two categories, but at the same time to ensure that as much of the data falls on the regression line, such as many examples of which include in the interval. We have a support vector machine regression model through training in some random linear data, as shown below:
Ε figure above controls the size of the interval regression. Adding more training examples in the middle of the interval does not affect the prediction model, and therefore support vector machine regression model ε insensitive. scikit-learn is to use LinearSVR to train the regression model.
The corresponding regression model figure above left.
The nonlinear regression task, you can use the same core machine SVM model solution.
By providing different C and ε, different models obtained results.
The left is a relatively small addition of the constraint i.e. regularization parameter C, is added to the right large regularization constraints.


Sixth, content summary
This article is to learn a support vector machine from a practical point of view, including the classification of linearly separable support vector machines, linear inseparable. Linearly separable comes to hard interval classification, linearly inseparable comes to soft margin classification, Gaussian kernel polynomial nuclear skills and techniques. Regression and support vector machine use, based on support vector machine regression characteristic problem, we can set the interval size support vector machine, SVM applied to the detection of outliers. This article does not relate to the relevant theoretical explanation SVM, if interested can see more about the number of public support vector machine.


(For a better understanding of the relevant knowledge, welcome to join the community of intelligent algorithm, the "smart algorithm" No. send public "community", to join the group and QQ micro-channel algorithm group)
Reply code for this article Keywords: svm_code

Guess you like

Origin blog.csdn.net/x454045816/article/details/79009864