Machine learning practice----SKLearn implements minst classification of SVM

1. Introduction

win10, python 3.6, notebook

reference:

[Machine Learning Practice] Support Vector Machine----Classification library and simple training mnist

https://blog.csdn.net/u013597931/article/details/80076058

SVM study notes (2)----Handwritten digit recognition

https://blog.csdn.net/chunxiao2008/article/details/50448154

 

2. Detailed explanation of SVM module

Basic Theory -------Algorithm Library------Attributes-------Methods

Basic theory:

  • The basic idea is to use the maximum interval for classification
  • To deal with nonlinear problems, the feature vector is mapped to a high-dimensional space through the kernel function, thereby becoming linearly separable, but the operation is run in a low-dimensional space.
  • Taking into account the possible presence of noise in the data, slack variables are also introduced.

 

These libraries are provided in the sklearn.svm module: 

è¿éåå¾çæè¿°

 It is roughly divided into these categories (except that svm_l1_min_c returns the lowest bound of the penalty parameter C) 

è¿éåå¾çæè¿°

 

Main parameters of the classification library:

è¿éåå¾çæè¿°

For an introduction to other parameters, you can see the official library: 
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm 
or look at this blogger’s summary (the summary is very good): 
http:/ /www.cnblogs.com/pinard/p/6117515.html
 

Attributes:

è¿éåå¾çæè¿°

 

method:

It is the same as the general model, including fit() for training, score() for scoring, predict() for predicting new samples, and decision_function() for calculating the distance from the sample point to the separation hyperplane, etc.

 

3. Simple implementation of SVM classification

Three different SVCs were used for classification.

  • SVC:     svm.SVC(kernel='linear')
  • LinearSVC:  svm.LinearSVC()
  • NuSVC:      svm.NuSVC(kernel='linear'

 

As can be seen:

 

  • The separation hyperplane obtained by LinearSVC training is different from that of SVC;
  • The support vector of the NuSVC model accounts for half of the number of samples. When nu=0.01, it is the same as SVC. Therefore, when there are requirements for the training error or the percentage of support vectors, you can choose NuSVC

1. Data set

##用于可视化图表
import matplotlib.pyplot as plt
##用于做科学计算
import numpy as np
##用于做数据分析
import pandas as pd
##用于加载数据或生成数据等
from sklearn import datasets
##加载svm模型
from sklearn import svm
##随机生成一组特征数量为2,样本数量为80的数据集
X, y = datasets.make_blobs(n_samples=80, n_features=2, centers=2, random_state=3)
fig=plt.figure(figsize=(10,8))
plt.xlim(-8,4)  #  设置x轴刻度范围
plt.ylim(-5,8)  #  设置y轴刻度范围
plt.scatter(X[:, 0], X[:, 1], c=y, s=30)
plt.show()

 

2、 svm.SVC(kernel='linear')

##加载svm的svc模型,选择线性核linear
model_svc=svm.SVC(kernel='linear')
model_svc.fit(X,y)

print("各类的支持向量在训练样本中的索引",model_svc.support_)
print("各类所有的支持向量:\n",model_svc.support_vectors_)
print("各类各有多少个支持向量",model_svc.n_support_)
print("各特征系数",model_svc.coef_)
print("截距",model_svc.intercept_)
print("各样本点到分离超平面的距离:\n",model_svc.decision_function(X))

fig=plt.figure(figsize=(10,8))
plt.xlim(-8,4)  #  设置x轴刻度范围
plt.ylim(-5,8)  #  设置y轴刻度范围
##显示分离超平面
w1=model_svc.coef_[:,0]
w2=model_svc.coef_[:,1]
b=model_svc.intercept_
x1=np.linspace(-8,6,2) 
x2=(w1*x1+b)/(-1*w2)
x2_up=(w1*x1+b+1)/(-1*w2)
x2_down=(w1*x1+b-1)/(-1*w2)
plt.plot(x1,x2,'k-',linewidth=0.8)
plt.plot(x1,x2_up,'k--',linewidth=0.8)
plt.plot(x1,x2_down,'k--',linewidth=0.8)
##显示样本点和支持向量
plt.scatter(X[:, 0], X[:, 1], c=y,s=30)
plt.scatter(model_svc.support_vectors_[:, 0], model_svc.support_vectors_[:, 1],s=80,c='',edgecolors='b')
plt.show()
各类的支持向量在训练样本中的索引 [ 3 77 63]
各类所有的支持向量:
 [[ 0.21219196  1.74387328]
 [-1.23229972  3.89519459]
 [-2.94843418  0.3655385 ]]
各类各有多少个支持向量 [2 1]
各特征系数 [[-0.48951758 -0.32852537]]
截距 [-0.32333516]
各样本点到分离超平面的距离:
 [-1.96224709  1.96992652  1.61830594 -1.00011347 -3.03968748 -1.91355576
 -3.20222196  1.07605938  1.39390527  1.19794817 -3.09852679 -2.99356435
  1.83058651  2.46025289  1.84454041 -1.98203511  1.18207352 -2.21362739
 -1.93596757  1.5062249  -3.13955464 -1.41328098  2.11163776 -2.0100733
  1.23402066 -1.3997197   1.42460256  1.9676612   1.10767531  1.64961948
  1.95638419  1.51193805 -1.2642258   2.06733658  1.99862207  1.49307471
 -1.44123444 -1.54063897  2.21232256  3.39921728  1.08180429  1.72267793
 -3.1813601   1.61914905  1.59985133 -1.70286262 -1.94181226  1.59417872
  2.15236394 -2.64727844 -2.54908967 -1.45290411 -2.30745878 -2.58497233
  2.2307059  -2.6951711  -2.96443813 -1.73637146  2.20696118 -1.77028229
 -2.67467925 -1.60612382  2.59439321  0.99988654 -1.59570877  1.53629311
 -2.69403494  1.44783106 -2.07984685 -1.3734872   1.09058746  1.60125344
  1.76284029 -1.83576229 -1.90749178 -2.44163699  2.01923035 -0.99977302
  2.01835361 -1.9910022 ]

 

3、svm.LinearSVC()

##加载svm的LinearSVC模型
model_svc=svm.LinearSVC()
model_svc.fit(X,y)
print("各特征系数",model_svc.coef_)
print("截距",model_svc.intercept_)
print("各样本点到分离超平面的距离:\n",model_svc.decision_function(X))

fig=plt.figure(figsize=(10,8))
plt.xlim(-8,4)  #  设置x轴刻度范围
plt.ylim(-5,8)  #  设置y轴刻度范围
##显示分离超平面
w1=model_svc.coef_[:,0]
w2=model_svc.coef_[:,1]
b=model_svc.intercept_
x1=np.linspace(-8,6,2) 
x2=(w1*x1+b)/(-1*w2)
x2_up=(w1*x1+b+1)/(-1*w2)
x2_down=(w1*x1+b-1)/(-1*w2)
plt.plot(x1,x2,'k-',linewidth=0.8)
plt.plot(x1,x2_up,'k--',linewidth=0.8)
plt.plot(x1,x2_down,'k--',linewidth=0.8)

##显示样本点和支持向量
plt.scatter(X[:, 0], X[:, 1], c=y,s=30)
plt.show()
各特征系数 [[-0.43861872 -0.34666587]]
截距 [-0.17687917]
各样本点到分离超平面的距离:
 [-1.93588632  1.87233349  1.51364971 -0.87449188 -2.92992153 -1.76381697
 -3.07749808  1.04025334  1.29833361  1.11381917 -2.93977862 -2.79956624
  1.71631792  2.39358778  1.73195783 -1.92786837  1.09573885 -2.04452242
 -1.82100181  1.42159955 -2.9877508  -1.29033632  2.02059401 -1.88099763
  1.19084793 -1.30551704  1.40518201  1.87446768  0.99415791  1.64725845
  1.9009514   1.43403277 -1.1375335   1.94926655  1.87258846  1.44713253
 -1.30509778 -1.4745769   2.12237058  3.19811264  1.04509393  1.58901995
 -3.04486552  1.48196178  1.52057507 -1.59990501 -1.8120006   1.48499025
  2.06828345 -2.51227207 -2.3304399  -1.36943056 -2.11483449 -2.43410562
  2.09715535 -2.51291992 -2.80568943 -1.64868248  2.14688497 -1.69265193
 -2.59427138 -1.57782921  2.51736177  0.98963954 -1.54715292  1.50308216
 -2.61135208  1.38378795 -1.94731015 -1.30279915  0.96583083  1.58100889
  1.64279966 -1.796035   -1.8219179  -2.28425194  1.91261392 -0.98670045
  1.90619623 -1.79154379]

 

4、 svm.NuSVC(kernel='linear')

##加载svm的NuSVC模型,nu为默认值0.5
model_svc=svm.NuSVC(kernel='linear')
#model_svc=svm.NuSVC(kernel='linear',nu=0.01)
model_svc.fit(X,y)
print("各类的支持向量在训练样本中的索引",model_svc.support_)
print("各类各有多少个支持向量",model_svc.n_support_)
print("各特征系数",model_svc.coef_)
print("截距",model_svc.intercept_)
print("各样本点到分离超平面的距离:\n",model_svc.decision_function(X))

fig=plt.figure(figsize=(10,8))
plt.xlim(-8,4)  #  设置x轴刻度范围
plt.ylim(-5,8)  #  设置y轴刻度范围
##显示分离超平面
w1=model_svc.coef_[:,0]
w2=model_svc.coef_[:,1]
b=model_svc.intercept_
x1=np.linspace(-8,6,2) 
x2=(w1*x1+b)/(-1*w2)
x2_up=(w1*x1+b+1)/(-1*w2)
x2_down=(w1*x1+b-1)/(-1*w2)
plt.plot(x1,x2,'k-',linewidth=0.8)
plt.plot(x1,x2_up,'k--',linewidth=0.8)
plt.plot(x1,x2_down,'k--',linewidth=0.8)

##显示样本点和支持向量
plt.scatter(X[:, 0], X[:, 1], c=y,s=30)
plt.scatter(model_svc.support_vectors_[:, 0], model_svc.support_vectors_[:, 1],s=80,c='',edgecolors='b')
plt.show()
各类的支持向量在训练样本中的索引 [ 0  3  5 18 21 25 32 36 37 45 46 51 57 59 61 64 69 73 74 77 79  2  7  8
  9 16 19 24 26 28 31 35 40 43 44 47 63 65 67 70 71]
各类各有多少个支持向量 [21 20]
各特征系数 [[-0.26852918 -0.18506518]]
截距 [-0.07402223]
各样本点到分离超平面的距离:
 [-1.00000001  1.18344728  0.98651753 -0.45373219 -1.59369374 -0.9613798
 -1.68303355  0.69021961  0.86209936  0.75377706 -1.62199582 -1.56013705
  1.10412131  1.46001562  1.11206673 -1.00846728  0.74471129 -1.12708414
 -0.97711454  0.92581127 -1.64554153 -0.68461074  1.26315788 -1.017172
  0.77771057 -0.67970603  0.886296    1.18259074  0.70066156  1.01348247
  1.17979741  0.9296235  -0.60106054  1.23592282  1.1968279   0.92205787
 -0.6989911  -0.76097669  1.31946142  1.97167963  0.69334256  1.04208873
 -1.67029696  0.98397159  0.97856961 -0.84810873 -0.97900041  0.97262943
  1.28653695 -1.3723103  -1.30974513 -0.7103885  -1.17727995 -1.33606029
  1.32568017 -1.39466306 -1.54714741 -0.86822923  1.31923904 -0.88809099
 -1.3926683  -0.80103249  1.53393157  0.65006995 -0.79334001  0.94736295
 -1.4032617   0.89512436 -1.0557987  -0.66724352  0.69008091  0.9848262
  1.0657701  -0.92815667 -0.96394481 -1.25544599  1.21013198 -0.46397868
  1.20912877 -1.        ]

 

4. minst data set

minist official website:  http://yann.lecun.com/exdb/mnist/

The data set on the official website is in gz format. Under Windows, it is more troublesome to decompress the gz format.

Therefore, two data sets are used for training:

  • The first is the decompressed official website data set,
  • One is mnist.pkl.gz and compares

 

1、mnist.pkl.gz

Reference: https://blog.csdn.net/chunxiao2008/article/details/50448154

Directly call the SVM in the scikit-learn library, use the default parameters, and 1,000 pictures of handwritten digits, and the number of accurately judged pictures is as high as 9,435.

The above code is not verified with the verification set. This is because in this example, the test set and the verification set are used to judge the same thing, and there is no need to deliberately use the verification set to verify it again.

Import Data:

# import cPickle
import _pickle as cPickle
import gzip
import numpy as np
from sklearn import svm
import time

def load_data():
    """
    返回包含训练数据、验证数据、测试数据的元组的模式识别数据
    训练数据包含50,000张图片,测试数据和验证数据都只包含10,000张图片
    """
    f = gzip.open('MNIST_data/mnist.pkl.gz', 'rb')
    # training_data, validation_data, test_data = cPickle.load(f)
    training_data, validation_data, test_data = cPickle.load(f,encoding='bytes')
    f.close()
    return (training_data, validation_data, test_data)

Classification:

# 使用SVM分类器,从MNIST数据集中进行手写数字识别的分类程序


def svm_baseline():
    #print time.strftime('%Y-%m-%d %H:%M:%S') 
    print(time.strftime('%Y-%m-%d %H:%M:%S'))
    training_data, validation_data, test_data = load_data()
    # 传递训练模型的参数,这里用默认的参数
    clf = svm.SVC()
    # clf = svm.SVC(C=8.0, kernel='rbf', gamma=0.00,cache_size=8000,probability=False)
    # 进行模型训练
    clf.fit(training_data[0], training_data[1])
    # test
    # 测试集测试预测结果
    predictions = [int(a) for a in clf.predict(test_data[0])]
    num_correct = sum(int(a == y) for a, y in zip(predictions, test_data[1]))
    print("%s of %s test values correct." % (num_correct, len(test_data[1])))
    print(time.strftime('%Y-%m-%d %H:%M:%S'))

if __name__ == "__main__":
    svm_baseline()

result:

2018-12-17 15:16:57
9435 of 10000 test values correct.
2018-12-17 15:29:36

 

2、t10k-images.idx3-ubyte等

data:

import numpy as np
import struct
import matplotlib.pyplot as plt
import os
##加载svm模型
from sklearn import svm
###用于做数据预处理
from sklearn import preprocessing
import time
path='MNIST_data'
def load_mnist_train(path, kind='train'):    
    labels_path = os.path.join(path,'%s-labels.idx1-ubyte'% kind)
    images_path = os.path.join(path,'%s-images.idx3-ubyte'% kind)
    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II',lbpath.read(8))
        labels = np.fromfile(lbpath,dtype=np.uint8)
    with open(images_path, 'rb') as imgpath:
        magic, num, rows, cols = struct.unpack('>IIII',imgpath.read(16))
        images = np.fromfile(imgpath,dtype=np.uint8).reshape(len(labels), 784)
    return images, labels
def load_mnist_test(path, kind='t10k'):
    labels_path = os.path.join(path,'%s-labels.idx1-ubyte'% kind)
    images_path = os.path.join(path,'%s-images.idx3-ubyte'% kind)
    with open(labels_path, 'rb') as lbpath:
        magic, n = struct.unpack('>II',lbpath.read(8))
        labels = np.fromfile(lbpath,dtype=np.uint8)
    with open(images_path, 'rb') as imgpath:
        magic, num, rows, cols = struct.unpack('>IIII',imgpath.read(16))
        images = np.fromfile(imgpath,dtype=np.uint8).reshape(len(labels), 784)
    return images, labels   
train_images,train_labels=load_mnist_train(path)
test_images,test_labels=load_mnist_test(path)

Classification:

# 标准化
X=preprocessing.StandardScaler().fit_transform(train_images)  
X_train=X[0:60000]
y_train=train_labels[0:60000]

# 模型训练
print(time.strftime('%Y-%m-%d %H:%M:%S'))
model_svc = svm.SVC()
model_svc.fit(X_train,y_train)
print(time.strftime('%Y-%m-%d %H:%M:%S'))

# 评分并预测
x=preprocessing.StandardScaler().fit_transform(test_images)
x_test=x[0:10000]
y_pred=test_labels[0:10000]
print(model_svc.score(x_test,y_pred))
y=model_svc.predict(x_test)

result:

2018-12-17 17:20:00
2018-12-17 17:30:53

0.9657

 

Summarize:

It can be seen that using two data sets, the running time is more than 10 minutes, and the output results have large errors, which can be adjusted through parameter adjustment.

Important parameters:

  • C: Floating point type, optional (default=1.0). Penalty parameter C for the error term 
  • kernel: character type, optional (default='rbf'). Specify the kernel function type. Can only be 'linear', 'poly', 'rbf', 'sigmoid', etc.
  • degree : integer, optional (default=3). When using the polynomial kernel function ('poly'), the parameter d of the polynomial kernel function can be ignored when using other kernel functions. 
  • gamma : floating point type, optional (default=0.0). Coefficients of 'rbf', 'poly' and 'sigmoid' kernel functions. If gamma is 0, the reciprocal value of the feature dimension will actually be used for calculations. That is, if the feature is 100 dimensions, the actual gamma is 1/100. 
  • oef0: floating point type, optional (default=0.0). The independent terms of the kernel function are only meaningful for 'poly' and 'sigmoid' kernels. 

 

Machine learning expert Andrew Ng said that regarding the SVM classification algorithm, he has always used the Gaussian kernel function, and he has basically never used other kernel functions. It can be seen that this kernel function is the most widely used. 

The gamma parameter, when using a Gaussian kernel for mapping, if it is chosen to be very small, the weight on the high-order features actually decays very quickly, so it is actually (numerically approximated) equivalent to a low-dimensional subspace; inversely In the past, if gamma is selected very large, any data can be mapped to be linearly separable - this can easily lead to very serious over-fitting problems. 

The C parameter is the weight between finding the hyperplane with the largest margin and "ensuring the minimum deviation of the data points". The larger the C, the smaller the deviation allowed by the model. 

For the same C, the larger the gamma, the closer the classification boundary is to the sample. For the same gamma, the larger C is, the stricter the classification. 

 

If gamma is too large, overfitting cannot be prevented no matter how large C is.

When gamma is small, the classification boundaries are very linear.

When taking intermediate values, the gamma and C of a good model are roughly distributed at the diagonal position.

It should also be noted that when gamma takes intermediate values, the value of C can be very large. 

Guess you like

Origin blog.csdn.net/bailixuance/article/details/85051594