Use of support vector machine SVM in Python (with examples and source code)

From: http://www.cnblogs.com/luyaoblog/p/6775342.html

In addition to using the svm algorithm in the PRTools toolbox in Matlab, support vector machines can also be used for classification in Python. Because the sklearn library in Python also integrates the SVM algorithm, the running environment of this article is Pycharm.

First, import the sklearn algorithm package

  The Scikit-Learn library has implemented all basic machine learning algorithms. For specific use, please refer to the official documentation: http://scikit-learn.org/stable/auto_examples/index.html#support-vector-machines .

  Many algorithms are integrated in skleran, and the way to import the package is as follows,

  Logistic regression: from sklearn.linear_model import LogisticRegression

      Naive Bayes: from sklearn.naive_bayes import GaussianNB

   K-近邻:from sklearn.neighbors import KNeighborsClassifier

   决策树:from sklearn.tree import DecisionTreeClassifier

   Support Vector Machine: from sklearn import svm

 Second, the use of svc in sklearn

(1) Use loadtxt in numpy to read in the data file

  How to use loadtxt ():

  

  fname: file path. eg: C: /Dataset/iris.txt.

  dtype: data type. eg: float, str, etc.

  delimiter: Delimiter. eg: ','.

  converters: A dictionary that maps data columns and conversion functions. eg: {1: fun}, which means to convert the corresponding conversion function in the second column.

  usecols: Select columns of data.

  Take the Iris orchid data set as an example:

  Because the original data set of Iris downloaded from the UCI database looks like this, the first four columns are feature columns, and the fifth column is a category column. There are three categories Iris-setosa, Iris-versicolor, and Iris-virginica.   

  

  When using the loadtxt function in numpy to import the data set, it is assumed that the data type dtype is floating point, but it is obvious that the data type of the fifth column is not floating point.

  Therefore, we have to do an additional job, that is, to map the fifth column to data of floating point type through the conversion function through the converters parameter in the loadtxt () function.

  First, we have to write a conversion function:

1

2

3

def iris_type(s):

    it = {'Iris-setosa'0'Iris-versicolor'1'Iris-virginica'2}

    return it[s]

  Next read in the data, converters = {4: iris_type} "4" refers to the fifth column:

1

2

path = u'D:/f盘/python/学习/iris.data'  # 数据文件路径

data = np.loadtxt(path, dtype=float, delimiter=',', converters={4: iris_type})

  Read results:

  

(2) Divide Iris into training set and test set

1

2

3

x, y = np.split(data, (4,), axis=1)

= x[:, :2]

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, train_size=0.6)

  1. split (data, split position, axis = 1 (horizontal split) or 0 (vertical split)).

  2. x = x [:,: 2] is to make drawing in the later stage more intuitive, so only the first two columns of eigenvalue vector training are taken.

  3. sklearn.model_selection.train_test_split randomly divides the training set and test set. train_test_split (train_data, train_target, test_size = number, random_state = 0)

  Parameter explanation:

  train_data: the sample feature set to be divided

  train_target: the sample result to be divided

  test_size: the proportion of samples, if it is an integer, it is the number of samples

  random_state: is the seed of random numbers.

  Random number seed: In fact, it is the number of the random number of the group. When you need to repeat the test, make sure you get the same random number. For example, if you fill in 1 each time, the random array you get is the same if the other parameters are the same. But fill 0 or not, it will be different every time. The generation of random numbers depends on the seeds. The relationship between random numbers and seeds follows the following two rules: different seeds produce different random numbers; the same seeds produce the same random numbers even if the instances are different.

(3) Train the svm classifier

1

2

3

# clf = svm.SVC(C=0.1, kernel='linear', decision_function_shape='ovr')

    clf = svm.SVC(C=0.8, kernel='rbf', gamma=20, decision_function_shape='ovr')

    clf.fit(x_train, y_train.ravel())

   When kernel = 'linear', it is a linear kernel. The larger the C, the better the classification effect, but there may be overfitting (defaul C = 1).

   When kernel = 'rbf' (default), it is a Gaussian kernel. The smaller the gamma value, the more continuous the classification interface; the larger the gamma value, the more “scattered” the classification interface, the better the classification effect, but there may be overfitting.

  When decision_function_shape = 'ovr', it is one v rest, that is, one category is divided with other categories,

  When decision_function_shape = 'ovo', it is one v one, that is, to divide the two categories, and use the two classification method to simulate the results of multiple classification.

(4) Calculate the accuracy of svc classifier

1

2

3

4

5

6

print clf.score(x_train, y_train)  # 精度

y_hat = clf.predict(x_train)

show_accuracy(y_hat, y_train, '训练集')

print clf.score(x_test, y_test)

y_hat = clf.predict(x_test)

show_accuracy(y_hat, y_test, '测试集')

 The result is:

  If you want to view the decision function, you can use decision_function ()

1

2

print 'decision_function:\n', clf.decision_function(x_train)

print '\npredict:\n', clf.predict(x_train)

 The result is:

  The value of each column in decision_function represents the distance from each category.

(5) Draw the image

  1. Determine the coordinate axis range, the x and y axes respectively represent two features

1

2

3

4

5

x1_min, x1_max = x[:, 0].min(), x[:, 0].max()  # 第0列的范围

x2_min, x2_max = x[:, 1].min(), x[:, 1].max()  # 第1列的范围

x1, x2 = np.mgrid[x1_min:x1_max:200j, x2_min:x2_max:200j]  # 生成网格采样点

grid_test = np.stack((x1.flat, x2.flat), axis=1)  # 测试点

# print 'grid_test = \n', grid_testgrid_hat = clf.predict(grid_test)       # 预测分类值grid_hat = grid_hat.reshape(x1.shape)  # 使之与输入的形状相同

   The mgrid () function is used here. The function of this function is briefly introduced here:

   Suppose that the objective function F (x, y) = x + y. The x-axis range is 1 ~ 3, and the y-axis range is 4 ~ 6. When drawing an image, there are four main steps:

  [Step1: x expansion] (expansion to the right):

       [1 1 1]

   [2 2 2]

   [3 3 3]

  [Step2: y expansion] (downward expansion):

   [4 5 6]

   [4 5 6]

   [4 5 6]

  [Step3: Positioning (xi, yi)]:

   [(1,4) (1,5) (1,6)]

   [(2,4) (2,5) (2,6)]

   [(3,4) (3,5) (3,6)]

  [Step4: Substitute (xi, yi) into F (x, y) = x + y]

  So here x1, x2 = np.mgrid [x1_min: x1_max: 200j, x2_min: x2_max: 200j] The result is:

  

  Then through the stack () function, axis = 1, generate test points

  

  2. Specify the default font

1

2

mpl.rcParams['font.sans-serif'= [u'SimHei']

mpl.rcParams['axes.unicode_minus'= False

  3. Draw

1

2

3

4

5

6

7

8

9

10

11

12

cm_light = mpl.colors.ListedColormap(['#A0FFA0''#FFA0A0''#A0A0FF'])

cm_dark = mpl.colors.ListedColormap(['g''r''b'])

plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)

plt.scatter(x[:, 0], x[:, 1], c=y, edgecolors='k', s=50, cmap=cm_dark)  # 样本

plt.scatter(x_test[:, 0], x_test[:, 1], s=120, facecolors='none', zorder=10)  # 圈中测试集样本

plt.xlabel(u'花萼长度', fontsize=13)

plt.ylabel(u'花萼宽度', fontsize=13)

plt.xlim(x1_min, x1_max)

plt.ylim(x2_min, x2_max)

plt.title(u'鸢尾花SVM二特征分类', fontsize=15)

# plt.grid()

plt.show()

   pcolormesh(x,y,z,cmap)这里参数代入x1,x2,grid_hat,cmap=cm_light绘制的是背景。

   scatter中edgecolors是指描绘点的边缘色彩,s指描绘点的大小,cmap指点的颜色。

   xlim指图的边界。

最终结果为:

源码:

 

# -*- coding:utf-8 -*-

# -*- coding:utf-8 -*-

from sklearn import svm

import numpy as np

import matplotlib.pyplot as plt

import matplotlib as mpl

from matplotlib import colors

from sklearn.model_selection import train_test_split

def iris_type(s):

    it = {'Iris-setosa'0'Iris-versicolor'1'Iris-virginica'2}

    return it[s]

def show_accuracy(y_hat, y_test, param):

    pass

 

path = 'F:\\Test\\iris.data' # 数据文件路径

data = np.loadtxt(path, dtype=float, delimiter=',', converters={4: iris_type})

 

x, y = np.split(data, (4,), axis=1)

= x[:, :2]

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, train_size=0.6)

 

# clf = svm.SVC(C=0.1, kernel='linear', decision_function_shape='ovr')

clf = svm.SVC(C=0.8, kernel='rbf', gamma=20, decision_function_shape='ovr')

clf.fit(x_train, y_train.ravel())

 

print clf.score(x_train, y_train)  # 精度

y_hat = clf.predict(x_train)

show_accuracy(y_hat, y_train, '训练集')

print clf.score(x_test, y_test)

y_hat = clf.predict(x_test)

show_accuracy(y_hat, y_test, '测试集')

 

print 'decision_function:\n', clf.decision_function(x_train)

print '\npredict:\n', clf.predict(x_train)

 

x1_min, x1_max = x[:, 0].min(), x[:, 0].max()  # 第0列的范围

x2_min, x2_max = x[:, 1].min(), x[:, 1].max()  # 第1列的范围

x1, x2 = np.mgrid[x1_min:x1_max:200j, x2_min:x2_max:200j]  # 生成网格采样点

grid_test = np.stack((x1.flat, x2.flat), axis=1)  # 测试点

 

 

mpl.rcParams['font.sans-serif'= [u'SimHei']

mpl.rcParams['axes.unicode_minus'= False

 

cm_light = mpl.colors.ListedColormap(['#A0FFA0''#FFA0A0''#A0A0FF'])

cm_dark = mpl.colors.ListedColormap(['g''r''b'])

 

# print 'grid_test = \n', grid_test

grid_hat = clf.predict(grid_test)       # 预测分类值

grid_hat = grid_hat.reshape(x1.shape)  # 使之与输入的形状相同

 

alpha = 0.5

plt.pcolormesh(x1, x2, grid_hat, cmap=cm_light)     # 预测值的显示

# plt.scatter(x[:, 0], x[:, 1], c=y, edgecolors='k', s=50, cmap=cm_dark)  # 样本

plt.plot(x[:, 0], x[:, 1], 'o', alpha=alpha, color='blue', markeredgecolor='k')

plt.scatter(x_test[:, 0], x_test[:, 1], s=120, facecolors='none', zorder=10)  # 圈中测试集样本

plt.xlabel(u'花萼长度', fontsize=13)

plt.ylabel(u'花萼宽度', fontsize=13)

plt.xlim(x1_min, x1_max)

plt.ylim(x2_min, x2_max)

plt.title(u'鸢尾花SVM二特征分类', fontsize=15)

# plt.grid()

plt.show()

来源:https://blog.csdn.net/yaoxy/article/details/78878446

发布了44 篇原创文章 · 获赞 130 · 访问量 137万+

Guess you like

Origin blog.csdn.net/gb4215287/article/details/105507860