支持向量机SVM:使用sklearn+python

版权声明:本文为博主原创文章,转载请标明原始博文地址。 https://blog.csdn.net/yuanlulu/article/details/81011849

代码

参考了别人的代码(http://ihoge.cn/2018/SVWSVC.html),增加了保存模型和打印信息。

这个例子主要是演示3种不同的核函数(线性核,高斯核和多项式核)的用法。

使用的数据是自动生成的,生成数据的接口是make_blobs。

from sklearn import svm
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
import numpy as np
from sklearn.externals import joblib

# 生成测试数据
X, y = make_blobs(n_samples=100, centers=3, random_state=0, cluster_std=0.8)

# 构造svm分类器实例
clf_linear = svm.SVC(C=1.0, kernel='linear')
clf_poly = svm.SVC(C=1.0, kernel='poly', degree=3)
clf_rbf = svm.SVC(C=1.0, kernel='rbf', gamma=0.5)
clf_rbf2 = svm.SVC(C=1.0, kernel='rbf', gamma=0.1)

plt.figure(figsize=(10, 10), dpi=144)

clfs = [clf_linear, clf_poly, clf_rbf, clf_rbf2]
titles = [  'Linear Kernel',
            'Polynomial Kernel with Degree=3',
            'Gaussian Kernel with gamma=0.5',
            'Gaussian Kernel with gamma=0.1']

# train and predict
for clf, i in zip(clfs, range(len(clfs))):
    clf.fit(X, y)
    print("{}'s score:{}".format(titles[i], clf.score(X,y)))
    out = clf.predict(X)
    print("out's shape:{}, out:{}".format(out.shape, out))
    # plt.subplot(2, 2, i+1)
    # plot_hyperplane(clf, X, y,  title=titles[i])

# 参考页面:http://scikit-learn.org/stable/modules/model_persistence.html
# http://sofasofa.io/forum_main_post.php?postid=1001002
# save trained model to disk-file
for clf, i in zip(clfs, range(len(clfs))):
    joblib.dump(clf, str(i)+'.pkl')

# load model from file and test
for i in range(len(clfs)):
    clf = joblib.load(str(i)+'.pkl')
    print( "{}'s score:{}".format( titles[i], clf.score( X, y ) ) )

作为一个支持向量分类算法,SVC有三个常用的接口:
- fit:训练
- predict:预测
- score:评估准确率

skearn的模型可以用joblib来保存和加载,这个类直接操作文件。
还有一个类可以把模型保存到内存变量,那就是pickle。模型的加载和保存参考官方的这里

上述代码的输出是:

Linear Kernel's score:0.98
out's shape:(100,), out:[1 0 1 0 0 0 2 2 1 0 0 0 1 0 2 1 2 0 2 2 2 2 2 0 1 1 1 1 2 2 0 1 1 0 2 0 0
 1 1 2 2 1 1 0 0 0 1 1 2 2 0 1 0 1 2 2 1 1 0 1 1 2 2 2 2 1 0 2 1 0 2 0 0 1
 1 0 0 0 2 1 0 0 1 0 1 0 0 0 1 0 1 1 2 2 2 2 0 0 2 2]
Polynomial Kernel with Degree=3's score:0.95
out's shape:(100,), out:[1 0 1 0 2 0 2 2 1 0 0 0 1 0 2 1 2 2 2 2 2 2 2 2 1 1 1 1 2 2 0 1 1 0 2 0 0
 1 1 2 2 1 1 0 0 0 1 1 2 2 0 1 0 1 2 2 1 1 0 1 1 2 2 2 2 1 0 2 1 0 2 0 0 1
 1 0 0 0 2 1 0 0 1 0 1 0 0 0 1 0 1 1 2 2 2 2 0 0 2 2]
Gaussian Kernel with gamma=0.5's score:0.98
out's shape:(100,), out:[1 0 1 0 0 0 2 2 1 0 0 0 1 0 2 1 2 0 2 2 2 2 2 0 1 1 1 1 2 2 0 1 1 0 2 0 0
 1 1 2 2 1 1 0 0 0 1 1 2 2 0 1 0 1 2 2 1 1 0 1 1 2 2 2 2 1 0 2 1 0 2 0 0 1
 1 0 0 0 2 1 0 0 1 0 1 0 0 0 1 0 1 1 2 2 2 2 0 0 2 2]
Gaussian Kernel with gamma=0.1's score:0.96
out's shape:(100,), out:[1 0 1 0 0 0 2 2 1 0 0 0 1 0 2 1 2 0 2 2 2 2 2 0 1 1 1 1 2 2 0 1 1 0 2 0 0
 1 1 2 2 1 1 0 0 0 1 1 2 2 0 1 0 1 2 2 1 1 0 1 1 2 2 2 2 1 0 2 1 0 2 1 2 1
 1 0 0 0 2 1 0 0 1 0 1 0 0 0 1 0 1 1 2 2 2 2 0 0 2 2]

Linear Kernel's score:0.98
Polynomial Kernel with Degree=3's score:0.95
Gaussian Kernel with gamma=0.5's score:0.98
Gaussian Kernel with gamma=0.1's score:0.96

模型搜索

利用GridSearchCV,可以对模型的超参空间进行搜索,选择最优的超参。

搜索的超参可以是多个维度,也可以是单个维度。

但是实际测试搜索C有问题,我只搜索了gamma。下面的代码也参考了曹永昌的代码。

from sklearn import svm
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

X, y = make_blobs(n_samples=500, centers=2, random_state=0, cluster_std=0.8)

X_train = X[:350]
y_train = y[:350]
X_test = X[350:]
y_test = y[350:]

thresholds = np.linspace(0, 0.001, 100)
C_nums = np.linspace(0.1, 0.02, 5)
#param_grid = {'gamma': thresholds, 'C':C_nums}
param_grid = {'gamma': thresholds}
#param_grid = {'C':C_nums}

clf = GridSearchCV(svm.SVC(kernel='rbf'), param_grid, cv=5)
clf.fit(X_train, y_train)
print("best param: {0}\nbest score: {1}".format(clf.best_params_,
                                                clf.best_score_))
y_pred = clf.predict(X_test)
print("y_pred:{}".format(y_pred))
print("y_test:{}".format(y_test))

print("查准率:",metrics.precision_score(y_pred, y_test))
print("召回率:",metrics.recall_score(y_pred, y_test))
print("F1:",metrics.f1_score(y_pred, y_test))

输出为:

best param: {'gamma': 0.00047474747474747476}
best score: 0.9857142857142858
y_pred:[0 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0
 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 1 0 0
 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 1
 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0
 1 0]
y_test:[0 1 1 0 1 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0
 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 1 0 0
 1 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 1 1
 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0
 1 0]
查准率: 0.9642857142857143
召回率: 1.0
F1: 0.9818181818181818

可以看到,搜索到的最优gamma值是0.00047474747474747476, 对应的最优分数是 0.9857142857142858。

上述工作其实可以自己写循环来完成,不过麻烦一点。

接口简介

简介一下上面代码用到的几个接口

make_blobs

用于产生高斯分布的聚类样本。官网文档

Parameters:

  • n_samples : int, optional (default=100)。样例点个数

The total number of points equally divided among clusters.

  • n_features : int, optional (default=2)。每个样例的维数

The number of features for each sample.

  • centers : int or array of shape [n_centers, n_features], optional。聚类的数量

(default=3) The number of centers to generate, or the fixed center locations.

  • cluster_std : float or sequence of floats, optional (default=1.0)。样例的标准差

The standard deviation of the clusters.

  • center_box : pair of floats (min, max), optional (default=(-10.0, 10.0))

The bounding box for each cluster center when centers are generated at random.

shuffle : boolean, optional (default=True)

Shuffle the samples.

  • random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns:

  • X : array of shape [n_samples, n_features]

The generated samples.

  • y : array of shape [n_samples]

The integer labels for cluster membership of each sample.

svm.SVC

这个接口用于创建一个svm实例

参数解释摘抄自这里

C:C-SVC的惩罚参数C?默认值是1.0

C越大,相当于惩罚松弛变量,希望松弛变量接近0,即对误分类的惩罚增大,趋向于对训练集全分对的情况,这样对训练集测试时准确率很高,但泛化能力弱。C值小,对误分类的惩罚减小,允许容错,将他们当成噪声点,泛化能力较强。

kernel :核函数,默认是rbf,可以是‘linear’,‘poly’, ‘rbf’

  • liner – 线性核函数:u’v
  • poly – 多项式核函数:(gamma*u’*v + coef0)^degree
  • rbf – RBF高斯核函数:exp(-gamma|u-v|^2)

degree :多项式poly函数的维度,默认是3,选择其他核函数时会被忽略。

gamma : ‘rbf’,‘poly’ 和‘sigmoid’的核函数参数。默认是’auto’,则会选择1/n_features

coef0 :核函数的常数项。对于‘poly’和 ‘sigmoid’有用。

probability :是否采用概率估计?.默认为False

shrinking :是否采用shrinking heuristic方法,默认为true

tol :停止训练的误差值大小,默认为1e-3

cache_size :核函数cache缓存大小,默认为200

class_weight :类别的权重,字典形式传递。设置第几类的参数C为weight * C(C-SVC中的C)

verbose :允许冗余输出?

max_iter :最大迭代次数。-1为无限制。

decision_function_shape :‘ovo’, ‘ovr’ or None, default=None3

random_state :数据洗牌时的种子值,int值

猜你喜欢

转载自blog.csdn.net/yuanlulu/article/details/81011849