Contour Recognition Based on SVM and Complex Network - Handwritten Digit Recognition

Table of contents

Introduction

Introduction to using technology

1. OpenCV——Canny

2. Gray level co-occurrence matrix (GLCM)

3. Harris Corner

4. Support Vector Machine Nonlinear SVM

build process

data preparation:

Data processing—transformation into contour plots:

Data processing - contour map binarization:

Implementing Nonlinear SVM

Summarize 

Advantages and disadvantages of SVM

advantage

shortcoming


Introduction

Handwritten digit recognition is a classic problem in the fields of computer vision and pattern recognition. Its goal is to convert handwritten digit images into corresponding digit labels. In order to realize handwritten digit recognition, steps such as image preprocessing, feature extraction and classifier training are often required.

In past research, people have used various methods to solve the problem of handwritten digit recognition. One of the common methods is contour recognition based on Support Vector Machine (SVM) and complex network.

The basic idea of ​​this method is to extract the contours of handwritten digital images first, and then use these contours as features to train a classifier, such as SVM. The concept of complex network is introduced to build a graph structure, where nodes represent the key points of the contour, and edges represent the connection relationship between key points.

Introduction to using technology

1. OpenCV——Canny

Edge detection is a commonly used image segmentation method, which is realized by extracting the features of discontinuous parts in the image. Currently, common edge detection operators include difference operator, Roberts operator, Sobel operator, Prewitt operator, Log operator and Canny operator.

The Canny operator is an edge detection operator proposed by John F. Canny in 1986, and is considered to be one of the most complete edge detection algorithms at present. Many commonly used image processing tools (such as MATLAB, OpenCV) have built-in Canny operator API.

The steps of the Canny edge detection algorithm:


The goal of the Canny edge detection algorithm is to find strong edges in the image and try to eliminate noise and weak edges. The steps of the algorithm are as follows:

Noise suppression: First, the image is smoothed using a Gaussian filter to reduce the effect of noise.

Calculate Gradients: Then, calculate the gradient strength and direction for each pixel in the image. This can be achieved by applying a filter such as Sobel.

Non-Maximum Suppression: Next, non-maximum suppression is performed on the gradient strength image to refine edges and remove edge responses.

Dual Thresholding: Then, use dual thresholding to determine the strength of the edges. According to the set threshold, the edge pixels are divided into strong edge, weak edge and non-edge pixels.

Edge connection: Finally, the edge connection algorithm is used to connect strong edge pixels and adjacent weak edge pixels to form a complete edge.
 

2. Gray level co-occurrence matrix (GLCM)

Gray Level Co-occurrence Matrix (Grey Level Co-occurrence  Matrix ), also known as Spatial Gray Level Dependency Matrix (SGLDM), is a statistical tool used to describe image texture features. It captures the spatial relationship between different gray levels in an image and provides statistics for analyzing texture features.

Definition: The gray level co-occurrence matrix is ​​an N×N matrix, where N represents the number of gray levels of the image. Each element of the matrix GLCM(i, j) represents how often a pixel pair with gray levels i and j occurs in the image.

Principle: The calculation of the gray level co-occurrence matrix is ​​based on the frequency of occurrence of pixel pairs in the image, which can be used to capture the texture features of the image. Information about the texture of an image can be obtained by analyzing the statistical properties of the matrix, such as contrast, correlation, energy, and entropy.

Formula: Assume that the grayscale range of the image is [0, K-1], where K represents the number of grayscales. Given a displacement vector (d, θ), where d represents the distance between pixel pairs, and θ represents the direction of displacement (for example, 0°, 45°, 90°, 135°), the calculation formula of the elements of the gray level co-occurrence matrix is ​​as follows :

GLCM(i, j) = \sum \sum \delta (g(x,y),i) * \delta (g(x+d,y+d),j)

Among them, δ(g(x, y), i) represents the indicator function of the gray value of pixel (x, y) i. g(x, y) represents the gray value of the pixel whose coordinates are (x, y) in the image.

The general process of calculating the gray level co-occurrence matrix is ​​as follows:

  1. Convert the original image to a grayscale image, ensuring the grayscale of each pixel in the image.

  2. Select the displacement vector (d, θ) as desired. Common choices include horizontal (1, 0), vertical (0, 1), and diagonal (1, 1).

  3. For each displacement vector, iterate through each pixel of the image.

  4. For the current pixel position (x, y), find the corresponding neighbor pixel position (x+d, y+d) according to the displacement vector (d, θ).

  5. According to the gray value of the neighbor pixel, update the corresponding element GLCM(g(x, y), g(x+d, y+d)) of the gray level co-occurrence matrix.

  6. Traverse the complete image and calculate the final gray level co-occurrence matrix.

  7. According to the gray level co-occurrence matrix, a series of texture features such as contrast, correlation, energy and entropy can be calculated.

3. Harris Corner

Harris corner detection is a classic computer vision algorithm for detecting corners in images. The algorithm was proposed by Chris Harris and Mike Stephens in 1988 and is named after them. The Harris corner detection algorithm is based on the local features of the corners and can robustly detect the corners in the image.

The basic idea of ​​the algorithm is to use a fixed window to slide in any direction on the image, compare the two cases before sliding and after sliding, the degree of grayscale change of pixels in the window, if there is sliding in any direction, there is a larger gray degree changes, then we can consider that there are corner points in the window.

4. Support Vector Machine Nonlinear SVM

Nonlinear Support Vector Machine (Nonlinear SVM) is a classification algorithm for dealing with nonlinear separable problems. Compared with the linear support vector machine, the nonlinear SVM maps the low-dimensional input space to the high-dimensional feature space by introducing a kernel function, thereby constructing a linearly separable hyperplane in the high-dimensional feature space.

Kernel function is the key part of nonlinear SVM, which defines the mapping relationship from input space to feature space. Through the kernel function, we can map the input samples from the low-dimensional space to the high-dimensional space, so that the samples in the high-dimensional space can be more easily separated by the linear hyperplane. Commonly used kernel functions include Gaussian kernel (RBF kernel), polynomial kernel, sigmoid kernel, etc.

The training process of nonlinear SVM can be briefly summarized as the following steps:

  1. Data preprocessing: Preprocessing the original data, including feature scaling, feature selection, etc.

  2. Kernel function selection: Select an appropriate kernel function according to the characteristics of the problem and the distribution of the data, and the Gaussian kernel is commonly used.

  3. Feature Mapping: Through the selected kernel function, the input samples are mapped to a high-dimensional feature space.

  4. Solve the optimization problem: In the high-dimensional feature space, use the optimization algorithm of the support vector machine to solve the corresponding optimization problem and find the optimal classification hyperplane.

  5. Decision function construction: According to the support vector obtained from the solution and the corresponding Lagrangian multipliers, the decision function of the nonlinear SVM is constructed.

  6. Prediction and classification: use the trained nonlinear SVM model to predict and classify new samples.

Nonlinear SVM has strong fitting ability and can handle complex nonlinear classification problems. However, the training complexity of nonlinear SVM is high, and it needs to solve the optimization problem in high-dimensional space, which may face the challenges of computing resources and time overhead. Therefore, when applying nonlinear SVM, it is necessary to weigh the balance between model complexity and computational efficiency, and select appropriate kernel functions and model parameters.

The technology used has been introduced, and the process of the project will be described in detail below.


build process

data preparation:

Handwritten ten numbers, try to write thicker here to facilitate subsequent recognition

Before batch processing data, a sliding window visualization is done:

First perform grayscale processing, and then create a slider to adjust the parameters of Canny edge detection

This code is separate, remember to change the picture name when using it:

import numpy as np
import cv2 as cv
from matplotlib import pyplot as plt

src = cv.imread("./p.jpg")

# 缩放图像到指定大小
scale = 0.5
resized = cv.resize(src, None, fx=scale, fy=scale, interpolation=cv.INTER_LINEAR)

# 调整图像对比度和亮度
alpha = 1.2
beta = 30
adjusted = cv.convertScaleAbs(resized, alpha=alpha, beta=beta)

# 图像旋转
angle = 360
(h, w) = adjusted.shape[:2]
center = (w // 2, h // 2)
M = cv.getRotationMatrix2D(center, angle, 1.0)
rotated = cv.warpAffine(adjusted, M, (w, h))

# 图像翻转
flipped = cv.flip(rotated, 1)

# 归一化像素值
normalized = cv.normalize(flipped, None, 0, 255, cv.NORM_MINMAX)

#  
cv.namedWindow("bar", cv.WINDOW_AUTOSIZE)
low_threshold = 0
high_threshold = 0
def do(x):
    global high_threshold
    if x != 0:
        high_threshold = 3 * x
cv.createTrackbar("low_threshold", "bar", 10, 100, do)

# 图像降噪
normalized = cv.GaussianBlur(normalized, (3, 3), 0)
gray = cv.cvtColor(normalized, cv.COLOR_BGR2GRAY)

# 计算灰度共生矩阵
glcm = cv.calcHist([gray], [0], None, [256], [0, 256])
glcm = cv.normalize(glcm, None, norm_type=cv.NORM_L1)

# 图像梯度
xgrad = cv.Sobel(gray, cv.CV_16SC1, 1, 0)
ygrad = cv.Sobel(gray, cv.CV_16SC1, 0, 1)

# 显示灰度共生矩阵
plt.plot(glcm)
plt.show()



while True:
    low_threshold = cv.getTrackbarPos("low_threshold", "bar")
    canny = cv.Canny(gray, low_threshold, high_threshold)

    # 应用KNN算法
    mask = np.zeros_like(canny)
    contours, _ = cv.findContours(canny, cv.RETR_EXTERNAL, cv.CHAIN_APPROX_SIMPLE)
    for contour in contours:
        if cv.contourArea(contour) > 100:
            x, y, w, h = cv.boundingRect(contour)
            mask[y:y + h, x:x + w] = 255

    cv.imshow("canny", canny)
    if cv.waitKey(1) & 0xFF == 27:
        break

cv.destroyAllWindows()

Data processing—transformation into contour plots:

The data is processed in batches as follows:

import os
import csv
import cv2
import sys
import time
import numpy as np
from skimage.feature import graycomatrix, graycoprops
from sklearn import svm
import networkx as nx
import matplotlib.pyplot as plt

# *****canny算子参数××××
max = 200
min = 30

# 设置Harris算法的参数
block_size = 2
ksize = 3
k = 0.04

# 创建一个有向图
G = nx.DiGraph()

with open('./path.csv', mode='r', encoding='utf-8') as file:
    csv_reader = csv.reader(file)
    for row in csv_reader:
        if row.__str__()[2:10] == 'readpath':
            data_dir = row.__str__()[11:-2]
        elif row.__str__()[2:10] == 'savepath':
            savepath = row.__str__()[11:-2]
        elif row.__str__()[2:4] == 'dx':
            dx = int(row.__str__()[5:-2])
        elif row.__str__()[2:4] == 'dy':
            dy = int(row.__str__()[5:-2])
        elif row.__str__()[2:5] == 'max':
            max = int(row.__str__()[6:-2])
        elif row.__str__()[2:5] == 'min':
            min = int(row.__str__()[6:-2])
        elif row.__str__()[2:7] == 'ksize':
            ksize1 = int(row.__str__()[8:-2])
        else:
            continue

classes = os.listdir(data_dir)
i=0
total = classes.__len__()

distances = [1, 2, 3]
angles = [0, np.pi/4, np.pi/2, 3*np.pi/4]
properties = ['contrast', 'dissimilarity', 'homogeneity', 'energy', 'correlation', 'ASM']

# 创建一个列表,存储所有的特征向量
X = []
# 创建一个列表,存储每个样本的标签
y = []

for cls in classes:

    # 读取图像
    img_src = cv2.imread(data_dir + cls, 1)
    # 图像预处理
    img_src = cv2.cvtColor(img_src, cv2.COLOR_BGR2RGB)  # 转换颜色通道
    img_src = cv2.resize(img_src, (224, 224))  # 调整大小
    img_src = img_src.astype(np.float32) / 255.0  # 归一化像素值

    if not os.path.exists(savepath + "processed/"):
        os.makedirs(savepath + "processed/")
    cv2.imwrite(savepath + "processed/" + cls, img_src)

    # 将图像转换为灰度图像
    gray_img = cv2.cvtColor(img_src, cv2.COLOR_BGR2GRAY).astype(np.uint8)  # 转换为无符号整数类型

    # 保存预处理后的图像
    if not os.path.exists(savepath + "gray/"):
        os.makedirs(savepath + "gray/")
    cv2.imwrite(savepath + "gray/" + cls, gray_img)

    # 使用Harris角点检测算法,检测图像中的角点
    corners = cv2.cornerHarris(gray_img, block_size, ksize, k)

    # 选取前n个角点,构建描述符
    n = 30
    sorted_corners = np.argsort(corners.ravel())[::-1][:n]
    descriptor = np.zeros((1, n, 2))
    descriptor[0, :, 0] = np.floor(sorted_corners / n)
    descriptor[0, :, 1] = sorted_corners % n
    descriptor = descriptor.astype(np.float32)

    # 计算GLCM特征向量
    glcm = graycomatrix(gray_img, distances, angles, symmetric=True, normed=True)
    glcm_props = []
    for prop in properties:
        glcm_props.append(graycoprops(glcm, prop).ravel())
    glcm_features = np.concatenate(glcm_props)
    # 保存 GLCM 特征
    with open(savepath + "glcm_features.csv", mode='a+', encoding='utf-8', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(glcm_features)

        # 将GLCM特征和Harris角点描述符合并为一个特征向量
        feature_vector = np.concatenate((glcm_features, descriptor.ravel()))

        # 将特征向量添加到特征向量列表
        X.append(feature_vector)

        # 将标签添加到标签列表
        y.append(i)

        # 将Harris角点作为有向图的节点
        for j in range(n):
            node_name = cls + '_' + str(j)
            G.add_node(node_name)

        # 将Harris角点之间的连线作为有向图的边
        for j in range(n):
            for k in range(j+1, n):
                edge_weight = np.linalg.norm(descriptor[0, j, :] - descriptor[0, k, :])
                node_1 = cls + '_' + str(j)
                node_2 = cls + '_' + str(k)
                G.add_edge(node_1, node_2, weight=edge_weight)

    img_src = cv2.imread(data_dir + cls, 1)
    gray_img = cv2.cvtColor(img_src, cv2.COLOR_BGR2GRAY)
    if not os.path.exists(savepath + "gray/"):  # 如果不存在路径,则创建这个路径,关键函数就在这两行,其他可以改变
        os.makedirs(savepath + "gray/")
    cv2.imwrite(savepath + "gray/" + cls, gray_img)

    #canny边缘检测
    canny = cv2.Canny(gray_img, min, max)     # 调用Canny函数,指定最大和最小阈值,其中apertureSize默认为3。
    if not os.path.exists(savepath + "canny/"):  # 如果不存在路径,则创建这个路径,关键函数就在这两行,其他可以改变
        os.makedirs(savepath + "canny/")
    cv2.imwrite(savepath + "canny/" + cls, canny)

    #显示处理进度
    i = i + 1
    sys.stdout.write('\r%s%%' % (i/total*100))
    sys.stdout.flush()


sys.stdout.write("\n")
sys.stdout.write("finish!")

For more information about complex networks, please see my other blog post: Complex Networks and NetworkX - Drawing of Directed and Undirected Graphs

Data processing - contour map binarization:


import cv2
import numpy as np
import os
import binascii

# # Set input/output paths
# input_path = "./num_re/canny/"
# output_path = "./num_two/"



# 定义函数将图片转换为二进制图像并保存为txt文件
def convert_image_to_binary(image_path, output_path):
    # 读取图片
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)

    # 缩放图片到32x32
    img_resized = cv2.resize(img, (32, 32), interpolation=cv2.INTER_AREA)

    # 将像素值二值化为0或1
    img_binary = np.where(img_resized > 0, 1, 0)

    # 将二进制图像转换为文本格式并保存到文件
    with open(output_path, "w") as f:
        for row in img_binary:
            for pixel in row:
                f.write(str(pixel))
            f.write("\n")

    return img_binary


# 遍历文件夹中的所有jpg图片并转换为二进制图像
folder_path = "./num_re/canny/"
output_folder = "./num_two/"
for filename in os.listdir(folder_path):
    if filename.endswith(".jpg"):
        image_path = os.path.join(folder_path, filename)
        output_path = os.path.join(output_folder, os.path.splitext(filename)[0] + ".txt")
        convert_image_to_binary(image_path, output_path)

Then you can train the model!


Implementing Nonlinear SVM

Next, we will use testSetRBF.txt and testSetRBF2.txt, the former as the training set and the latter as the test set. Dataset download address: https://github.com/Jack-Cherish/Machine-Learning/tree/master/SVM

Visualize datasets

Let's write a program to briefly look at the data set:

# -*-coding:utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np

def showDataSet(dataMat, labelMat):
    """
    数据可视化
    Parameters:
        dataMat - 数据矩阵
        labelMat - 数据标签
    Returns:
        无
    """
    data_plus = []                                  #正样本
    data_minus = []                                 #负样本
    for i in range(len(dataMat)):
        if labelMat[i] > 0:
            data_plus.append(dataMat[i])
        else:
            data_minus.append(dataMat[i])
    data_plus_np = np.array(data_plus)              #转换为numpy矩阵
    data_minus_np = np.array(data_minus)            #转换为numpy矩阵
    plt.scatter(np.transpose(data_plus_np)[0], np.transpose(data_plus_np)[1])   #正样本散点图
    plt.scatter(np.transpose(data_minus_np)[0], np.transpose(data_minus_np)[1]) #负样本散点图
    plt.show()

if __name__ == '__main__':
    dataArr,labelArr = loadDataSet('testSetRBF.txt')                        #加载训练集
    showDataSet(dataArr, labelArr)

Program running result:

It can be seen that the data are clearly linearly inseparable. Next, we write the kernel function according to the formula, and add the initialization parameter kTup to store the information related to the kernel function. At the same time, we only need to change the previous inner product operation into the operation of the kernel function. Finally, write the testRbf() function for testing. Create the svmMLiA.py file and write the code as follows:

# -*-coding:utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
import random



class optStruct:
	"""
	数据结构,维护所有需要操作的值
	Parameters:
		dataMatIn - 数据矩阵
		classLabels - 数据标签
		C - 松弛变量
		toler - 容错率
		kTup - 包含核函数信息的元组,第一个参数存放核函数类别,第二个参数存放必要的核函数需要用到的参数
	"""
	def __init__(self, dataMatIn, classLabels, C, toler, kTup):
		self.X = dataMatIn								#数据矩阵
		self.labelMat = classLabels						#数据标签
		self.C = C 										#松弛变量
		self.tol = toler 								#容错率
		self.m = np.shape(dataMatIn)[0] 				#数据矩阵行数
		self.alphas = np.mat(np.zeros((self.m,1))) 		#根据矩阵行数初始化alpha参数为0	
		self.b = 0 										#初始化b参数为0
		self.eCache = np.mat(np.zeros((self.m,2))) 		#根据矩阵行数初始化虎误差缓存,第一列为是否有效的标志位,第二列为实际的误差E的值。
		self.K = np.mat(np.zeros((self.m,self.m)))		#初始化核K
		for i in range(self.m):							#计算所有数据的核K
			self.K[:,i] = kernelTrans(self.X, self.X[i,:], kTup)

def kernelTrans(X, A, kTup): 
	"""
	通过核函数将数据转换更高维的空间
	Parameters:
		X - 数据矩阵
		A - 单个数据的向量
		kTup - 包含核函数信息的元组
	Returns:
	    K - 计算的核K
	"""
	m,n = np.shape(X)
	K = np.mat(np.zeros((m,1)))
	if kTup[0] == 'lin': K = X * A.T   					#线性核函数,只进行内积。
	elif kTup[0] == 'rbf': 								#高斯核函数,根据高斯核函数公式进行计算
		for j in range(m):
			deltaRow = X[j,:] - A
			K[j] = deltaRow*deltaRow.T
		K = np.exp(K/(-1*kTup[1]**2)) 					#计算高斯核K
	else: raise NameError('核函数无法识别')
	return K 											#返回计算的核K

def loadDataSet(fileName):
	"""
	读取数据
	Parameters:
	    fileName - 文件名
	Returns:
	    dataMat - 数据矩阵
	    labelMat - 数据标签
	"""
	dataMat = []; labelMat = []
	fr = open(fileName)
	for line in fr.readlines():                                     #逐行读取,滤除空格等
		lineArr = line.strip().split('\t')
		dataMat.append([float(lineArr[0]), float(lineArr[1])])      #添加数据
		labelMat.append(float(lineArr[2]))                          #添加标签
	return dataMat,labelMat

def calcEk(oS, k):
	"""
	计算误差
	Parameters:
		oS - 数据结构
		k - 标号为k的数据
	Returns:
	    Ek - 标号为k的数据误差
	"""
	fXk = float(np.multiply(oS.alphas,oS.labelMat).T*oS.K[:,k] + oS.b)
	Ek = fXk - float(oS.labelMat[k])
	return Ek

def selectJrand(i, m):
	"""
	函数说明:随机选择alpha_j的索引值

	Parameters:
	    i - alpha_i的索引值
	    m - alpha参数个数
	Returns:
	    j - alpha_j的索引值
	"""
	j = i                                 #选择一个不等于i的j
	while (j == i):
		j = int(random.uniform(0, m))
	return j

def selectJ(i, oS, Ei):
	"""
	内循环启发方式2
	Parameters:
		i - 标号为i的数据的索引值
		oS - 数据结构
		Ei - 标号为i的数据误差
	Returns:
	    j, maxK - 标号为j或maxK的数据的索引值
	    Ej - 标号为j的数据误差
	"""
	maxK = -1; maxDeltaE = 0; Ej = 0 						#初始化
	oS.eCache[i] = [1,Ei]  									#根据Ei更新误差缓存
	validEcacheList = np.nonzero(oS.eCache[:,0].A)[0]		#返回误差不为0的数据的索引值
	if (len(validEcacheList)) > 1:							#有不为0的误差
		for k in validEcacheList:   						#遍历,找到最大的Ek
			if k == i: continue 							#不计算i,浪费时间
			Ek = calcEk(oS, k)								#计算Ek
			deltaE = abs(Ei - Ek)							#计算|Ei-Ek|
			if (deltaE > maxDeltaE):						#找到maxDeltaE
				maxK = k; maxDeltaE = deltaE; Ej = Ek
		return maxK, Ej										#返回maxK,Ej
	else:   												#没有不为0的误差
		j = selectJrand(i, oS.m)							#随机选择alpha_j的索引值
		Ej = calcEk(oS, j)									#计算Ej
	return j, Ej 											#j,Ej

def updateEk(oS, k):
	"""
	计算Ek,并更新误差缓存
	Parameters:
		oS - 数据结构
		k - 标号为k的数据的索引值
	Returns:
		无
	"""
	Ek = calcEk(oS, k)										#计算Ek
	oS.eCache[k] = [1,Ek]									#更新误差缓存


def clipAlpha(aj,H,L):
	"""
	修剪alpha_j
	Parameters:
	    aj - alpha_j的值
	    H - alpha上限
	    L - alpha下限
	Returns:
	    aj - 修剪后的alpah_j的值
	"""
	if aj > H: 
		aj = H
	if L > aj:
		aj = L
	return aj

def innerL(i, oS):
	"""
	优化的SMO算法
	Parameters:
		i - 标号为i的数据的索引值
		oS - 数据结构
	Returns:
		1 - 有任意一对alpha值发生变化
		0 - 没有任意一对alpha值发生变化或变化太小
	"""
	#步骤1:计算误差Ei
	Ei = calcEk(oS, i)
	#优化alpha,设定一定的容错率。
	if ((oS.labelMat[i] * Ei < -oS.tol) and (oS.alphas[i] < oS.C)) or ((oS.labelMat[i] * Ei > oS.tol) and (oS.alphas[i] > 0)):
		#使用内循环启发方式2选择alpha_j,并计算Ej
		j,Ej = selectJ(i, oS, Ei)
		#保存更新前的aplpha值,使用深拷贝
		alphaIold = oS.alphas[i].copy(); alphaJold = oS.alphas[j].copy();
		#步骤2:计算上下界L和H
		if (oS.labelMat[i] != oS.labelMat[j]):
			L = max(0, oS.alphas[j] - oS.alphas[i])
			H = min(oS.C, oS.C + oS.alphas[j] - oS.alphas[i])
		else:
			L = max(0, oS.alphas[j] + oS.alphas[i] - oS.C)
			H = min(oS.C, oS.alphas[j] + oS.alphas[i])
		if L == H: 
			print("L==H")
			return 0
		#步骤3:计算eta
		eta = 2.0 * oS.K[i,j] - oS.K[i,i] - oS.K[j,j]
		if eta >= 0: 
			print("eta>=0")
			return 0
		#步骤4:更新alpha_j
		oS.alphas[j] -= oS.labelMat[j] * (Ei - Ej)/eta
		#步骤5:修剪alpha_j
		oS.alphas[j] = clipAlpha(oS.alphas[j],H,L)
		#更新Ej至误差缓存
		updateEk(oS, j)
		if (abs(oS.alphas[j] - alphaJold) < 0.00001): 
			print("alpha_j变化太小")
			return 0
		#步骤6:更新alpha_i
		oS.alphas[i] += oS.labelMat[j]*oS.labelMat[i]*(alphaJold - oS.alphas[j])
		#更新Ei至误差缓存
		updateEk(oS, i)
		#步骤7:更新b_1和b_2
		b1 = oS.b - Ei- oS.labelMat[i]*(oS.alphas[i]-alphaIold)*oS.K[i,i] - oS.labelMat[j]*(oS.alphas[j]-alphaJold)*oS.K[i,j]
		b2 = oS.b - Ej- oS.labelMat[i]*(oS.alphas[i]-alphaIold)*oS.K[i,j]- oS.labelMat[j]*(oS.alphas[j]-alphaJold)*oS.K[j,j]
		#步骤8:根据b_1和b_2更新b
		if (0 < oS.alphas[i]) and (oS.C > oS.alphas[i]): oS.b = b1
		elif (0 < oS.alphas[j]) and (oS.C > oS.alphas[j]): oS.b = b2
		else: oS.b = (b1 + b2)/2.0
		return 1
	else: 
		return 0

def smoP(dataMatIn, classLabels, C, toler, maxIter, kTup = ('lin',0)):
	"""
	完整的线性SMO算法
	Parameters:
		dataMatIn - 数据矩阵
		classLabels - 数据标签
		C - 松弛变量
		toler - 容错率
		maxIter - 最大迭代次数
		kTup - 包含核函数信息的元组
	Returns:
		oS.b - SMO算法计算的b
		oS.alphas - SMO算法计算的alphas
	"""
	oS = optStruct(np.mat(dataMatIn), np.mat(classLabels).transpose(), C, toler, kTup)				#初始化数据结构
	iter = 0 																						#初始化当前迭代次数
	entireSet = True; alphaPairsChanged = 0
	while (iter < maxIter) and ((alphaPairsChanged > 0) or (entireSet)):							#遍历整个数据集都alpha也没有更新或者超过最大迭代次数,则退出循环
		alphaPairsChanged = 0
		if entireSet:																				#遍历整个数据集   						
			for i in range(oS.m):        
				alphaPairsChanged += innerL(i,oS)													#使用优化的SMO算法
				print("全样本遍历:第%d次迭代 样本:%d, alpha优化次数:%d" % (iter,i,alphaPairsChanged))
			iter += 1
		else: 																						#遍历非边界值
			nonBoundIs = np.nonzero((oS.alphas.A > 0) * (oS.alphas.A < C))[0]						#遍历不在边界0和C的alpha
			for i in nonBoundIs:
				alphaPairsChanged += innerL(i,oS)
				print("非边界遍历:第%d次迭代 样本:%d, alpha优化次数:%d" % (iter,i,alphaPairsChanged))
			iter += 1
		if entireSet:																				#遍历一次后改为非边界遍历
			entireSet = False
		elif (alphaPairsChanged == 0):																#如果alpha没有更新,计算全样本遍历 
			entireSet = True  
		print("迭代次数: %d" % iter)
	return oS.b,oS.alphas 																			#返回SMO算法计算的b和alphas


def testRbf(k1 = 1.3):
	"""
	测试函数
	Parameters:
		k1 - 使用高斯核函数的时候表示到达率
	Returns:
	    无
	"""
	dataArr,labelArr = loadDataSet('testSetRBF.txt')						#加载训练集
	b,alphas = smoP(dataArr, labelArr, 200, 0.0001, 100, ('rbf', k1))		#根据训练集计算b和alphas
	datMat = np.mat(dataArr); labelMat = np.mat(labelArr).transpose()
	svInd = np.nonzero(alphas.A > 0)[0]										#获得支持向量
	sVs = datMat[svInd] 													
	labelSV = labelMat[svInd];
	print("支持向量个数:%d" % np.shape(sVs)[0])
	m,n = np.shape(datMat)
	errorCount = 0
	for i in range(m):
		kernelEval = kernelTrans(sVs,datMat[i,:],('rbf', k1))				#计算各个点的核
		predict = kernelEval.T * np.multiply(labelSV,alphas[svInd]) + b 	#根据支持向量的点,计算超平面,返回预测结果
		if np.sign(predict) != np.sign(labelArr[i]): errorCount += 1		#返回数组中各元素的正负符号,用1和-1表示,并统计错误个数
	print("训练集错误率: %.2f%%" % ((float(errorCount)/m)*100)) 			#打印错误率
	dataArr,labelArr = loadDataSet('testSetRBF2.txt') 						#加载测试集
	errorCount = 0
	datMat = np.mat(dataArr); labelMat = np.mat(labelArr).transpose() 		
	m,n = np.shape(datMat)
	for i in range(m):
		kernelEval = kernelTrans(sVs,datMat[i,:],('rbf', k1)) 				#计算各个点的核			
		predict=kernelEval.T * np.multiply(labelSV,alphas[svInd]) + b 		#根据支持向量的点,计算超平面,返回预测结果
		if np.sign(predict) != np.sign(labelArr[i]): errorCount += 1    	#返回数组中各元素的正负符号,用1和-1表示,并统计错误个数
	print("测试集错误率: %.2f%%" % ((float(errorCount)/m)*100)) 			#打印错误率


def showDataSet(dataMat, labelMat):
	"""
	数据可视化
	Parameters:
	    dataMat - 数据矩阵
	    labelMat - 数据标签
	Returns:
	    无
	"""
	data_plus = []                                  #正样本
	data_minus = []                                 #负样本
	for i in range(len(dataMat)):
		if labelMat[i] > 0:
			data_plus.append(dataMat[i])
		else:
			data_minus.append(dataMat[i])
	data_plus_np = np.array(data_plus)              #转换为numpy矩阵
	data_minus_np = np.array(data_minus)            #转换为numpy矩阵
	plt.scatter(np.transpose(data_plus_np)[0], np.transpose(data_plus_np)[1])   #正样本散点图
	plt.scatter(np.transpose(data_minus_np)[0], np.transpose(data_minus_np)[1]) #负样本散点图
	plt.show()

if __name__ == '__main__':
	testRbf()

The running result is shown in the figure below:

 It can be seen that the error rate of the training set is 3%, and the error rate of the test set is 4%. You can try to change different K1 parameters to observe the test error rate, training error rate, and the number of support vectors change with k1. You will find that K1 is too large, and there will be overfitting, that is, the error rate of the training set is low, but the error rate of the test set is high.

klearn builds SVM classifier

First, let's use the code we wrote in python for training. Create the file svm-digits.py file and write the code as follows:

# -*-coding:utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
import random



class optStruct:
	"""
	数据结构,维护所有需要操作的值
	Parameters:
		dataMatIn - 数据矩阵
		classLabels - 数据标签
		C - 松弛变量
		toler - 容错率
		kTup - 包含核函数信息的元组,第一个参数存放核函数类别,第二个参数存放必要的核函数需要用到的参数
	"""
	def __init__(self, dataMatIn, classLabels, C, toler, kTup):
		self.X = dataMatIn								#数据矩阵
		self.labelMat = classLabels						#数据标签
		self.C = C 										#松弛变量
		self.tol = toler 								#容错率
		self.m = np.shape(dataMatIn)[0] 				#数据矩阵行数
		self.alphas = np.mat(np.zeros((self.m,1))) 		#根据矩阵行数初始化alpha参数为0	
		self.b = 0 										#初始化b参数为0
		self.eCache = np.mat(np.zeros((self.m,2))) 		#根据矩阵行数初始化虎误差缓存,第一列为是否有效的标志位,第二列为实际的误差E的值。
		self.K = np.mat(np.zeros((self.m,self.m)))		#初始化核K
		for i in range(self.m):							#计算所有数据的核K
			self.K[:,i] = kernelTrans(self.X, self.X[i,:], kTup)

def kernelTrans(X, A, kTup): 
	"""
	通过核函数将数据转换更高维的空间
	Parameters:
		X - 数据矩阵
		A - 单个数据的向量
		kTup - 包含核函数信息的元组
	Returns:
	    K - 计算的核K
	"""
	m,n = np.shape(X)
	K = np.mat(np.zeros((m,1)))
	if kTup[0] == 'lin': K = X * A.T   					#线性核函数,只进行内积。
	elif kTup[0] == 'rbf': 								#高斯核函数,根据高斯核函数公式进行计算
		for j in range(m):
			deltaRow = X[j,:] - A
			K[j] = deltaRow*deltaRow.T
		K = np.exp(K/(-1*kTup[1]**2)) 					#计算高斯核K
	else: raise NameError('核函数无法识别')
	return K 											#返回计算的核K

def loadDataSet(fileName):
	"""
	读取数据
	Parameters:
	    fileName - 文件名
	Returns:
	    dataMat - 数据矩阵
	    labelMat - 数据标签
	"""
	dataMat = []; labelMat = []
	fr = open(fileName)
	for line in fr.readlines():                                     #逐行读取,滤除空格等
		lineArr = line.strip().split('\t')
		dataMat.append([float(lineArr[0]), float(lineArr[1])])      #添加数据
		labelMat.append(float(lineArr[2]))                          #添加标签
	return dataMat,labelMat

def calcEk(oS, k):
	"""
	计算误差
	Parameters:
		oS - 数据结构
		k - 标号为k的数据
	Returns:
	    Ek - 标号为k的数据误差
	"""
	fXk = float(np.multiply(oS.alphas,oS.labelMat).T*oS.K[:,k] + oS.b)
	Ek = fXk - float(oS.labelMat[k])
	return Ek

def selectJrand(i, m):
	"""
	函数说明:随机选择alpha_j的索引值

	Parameters:
	    i - alpha_i的索引值
	    m - alpha参数个数
	Returns:
	    j - alpha_j的索引值
	"""
	j = i                                 #选择一个不等于i的j
	while (j == i):
		j = int(random.uniform(0, m))
	return j

def selectJ(i, oS, Ei):
	"""
	内循环启发方式2
	Parameters:
		i - 标号为i的数据的索引值
		oS - 数据结构
		Ei - 标号为i的数据误差
	Returns:
	    j, maxK - 标号为j或maxK的数据的索引值
	    Ej - 标号为j的数据误差
	"""
	maxK = -1; maxDeltaE = 0; Ej = 0 						#初始化
	oS.eCache[i] = [1,Ei]  									#根据Ei更新误差缓存
	validEcacheList = np.nonzero(oS.eCache[:,0].A)[0]		#返回误差不为0的数据的索引值
	if (len(validEcacheList)) > 1:							#有不为0的误差
		for k in validEcacheList:   						#遍历,找到最大的Ek
			if k == i: continue 							#不计算i,浪费时间
			Ek = calcEk(oS, k)								#计算Ek
			deltaE = abs(Ei - Ek)							#计算|Ei-Ek|
			if (deltaE > maxDeltaE):						#找到maxDeltaE
				maxK = k; maxDeltaE = deltaE; Ej = Ek
		return maxK, Ej										#返回maxK,Ej
	else:   												#没有不为0的误差
		j = selectJrand(i, oS.m)							#随机选择alpha_j的索引值
		Ej = calcEk(oS, j)									#计算Ej
	return j, Ej 											#j,Ej

def updateEk(oS, k):
	"""
	计算Ek,并更新误差缓存
	Parameters:
		oS - 数据结构
		k - 标号为k的数据的索引值
	Returns:
		无
	"""
	Ek = calcEk(oS, k)										#计算Ek
	oS.eCache[k] = [1,Ek]									#更新误差缓存


def clipAlpha(aj,H,L):
	"""
	修剪alpha_j
	Parameters:
	    aj - alpha_j的值
	    H - alpha上限
	    L - alpha下限
	Returns:
	    aj - 修剪后的alpah_j的值
	"""
	if aj > H: 
		aj = H
	if L > aj:
		aj = L
	return aj

def innerL(i, oS):
	"""
	优化的SMO算法
	Parameters:
		i - 标号为i的数据的索引值
		oS - 数据结构
	Returns:
		1 - 有任意一对alpha值发生变化
		0 - 没有任意一对alpha值发生变化或变化太小
	"""
	#步骤1:计算误差Ei
	Ei = calcEk(oS, i)
	#优化alpha,设定一定的容错率。
	if ((oS.labelMat[i] * Ei < -oS.tol) and (oS.alphas[i] < oS.C)) or ((oS.labelMat[i] * Ei > oS.tol) and (oS.alphas[i] > 0)):
		#使用内循环启发方式2选择alpha_j,并计算Ej
		j,Ej = selectJ(i, oS, Ei)
		#保存更新前的aplpha值,使用深拷贝
		alphaIold = oS.alphas[i].copy(); alphaJold = oS.alphas[j].copy();
		#步骤2:计算上下界L和H
		if (oS.labelMat[i] != oS.labelMat[j]):
			L = max(0, oS.alphas[j] - oS.alphas[i])
			H = min(oS.C, oS.C + oS.alphas[j] - oS.alphas[i])
		else:
			L = max(0, oS.alphas[j] + oS.alphas[i] - oS.C)
			H = min(oS.C, oS.alphas[j] + oS.alphas[i])
		if L == H: 
			print("L==H")
			return 0
		#步骤3:计算eta
		eta = 2.0 * oS.K[i,j] - oS.K[i,i] - oS.K[j,j]
		if eta >= 0: 
			print("eta>=0")
			return 0
		#步骤4:更新alpha_j
		oS.alphas[j] -= oS.labelMat[j] * (Ei - Ej)/eta
		#步骤5:修剪alpha_j
		oS.alphas[j] = clipAlpha(oS.alphas[j],H,L)
		#更新Ej至误差缓存
		updateEk(oS, j)
		if (abs(oS.alphas[j] - alphaJold) < 0.00001): 
			print("alpha_j变化太小")
			return 0
		#步骤6:更新alpha_i
		oS.alphas[i] += oS.labelMat[j]*oS.labelMat[i]*(alphaJold - oS.alphas[j])
		#更新Ei至误差缓存
		updateEk(oS, i)
		#步骤7:更新b_1和b_2
		b1 = oS.b - Ei- oS.labelMat[i]*(oS.alphas[i]-alphaIold)*oS.K[i,i] - oS.labelMat[j]*(oS.alphas[j]-alphaJold)*oS.K[i,j]
		b2 = oS.b - Ej- oS.labelMat[i]*(oS.alphas[i]-alphaIold)*oS.K[i,j]- oS.labelMat[j]*(oS.alphas[j]-alphaJold)*oS.K[j,j]
		#步骤8:根据b_1和b_2更新b
		if (0 < oS.alphas[i]) and (oS.C > oS.alphas[i]): oS.b = b1
		elif (0 < oS.alphas[j]) and (oS.C > oS.alphas[j]): oS.b = b2
		else: oS.b = (b1 + b2)/2.0
		return 1
	else: 
		return 0

def smoP(dataMatIn, classLabels, C, toler, maxIter, kTup = ('lin',0)):
	"""
	完整的线性SMO算法
	Parameters:
		dataMatIn - 数据矩阵
		classLabels - 数据标签
		C - 松弛变量
		toler - 容错率
		maxIter - 最大迭代次数
		kTup - 包含核函数信息的元组
	Returns:
		oS.b - SMO算法计算的b
		oS.alphas - SMO算法计算的alphas
	"""
	oS = optStruct(np.mat(dataMatIn), np.mat(classLabels).transpose(), C, toler, kTup)				#初始化数据结构
	iter = 0 																						#初始化当前迭代次数
	entireSet = True; alphaPairsChanged = 0
	while (iter < maxIter) and ((alphaPairsChanged > 0) or (entireSet)):							#遍历整个数据集都alpha也没有更新或者超过最大迭代次数,则退出循环
		alphaPairsChanged = 0
		if entireSet:																				#遍历整个数据集   						
			for i in range(oS.m):        
				alphaPairsChanged += innerL(i,oS)													#使用优化的SMO算法
				print("全样本遍历:第%d次迭代 样本:%d, alpha优化次数:%d" % (iter,i,alphaPairsChanged))
			iter += 1
		else: 																						#遍历非边界值
			nonBoundIs = np.nonzero((oS.alphas.A > 0) * (oS.alphas.A < C))[0]						#遍历不在边界0和C的alpha
			for i in nonBoundIs:
				alphaPairsChanged += innerL(i,oS)
				print("非边界遍历:第%d次迭代 样本:%d, alpha优化次数:%d" % (iter,i,alphaPairsChanged))
			iter += 1
		if entireSet:																				#遍历一次后改为非边界遍历
			entireSet = False
		elif (alphaPairsChanged == 0):																#如果alpha没有更新,计算全样本遍历 
			entireSet = True  
		print("迭代次数: %d" % iter)
	return oS.b,oS.alphas 																			#返回SMO算法计算的b和alphas


def img2vector(filename):
	"""
	将32x32的二进制图像转换为1x1024向量。
	Parameters:
		filename - 文件名
	Returns:
		returnVect - 返回的二进制图像的1x1024向量
	"""
	returnVect = np.zeros((1,1024))
	fr = open(filename)
	for i in range(32):
		lineStr = fr.readline()
		for j in range(32):
			returnVect[0,32*i+j] = int(lineStr[j])
	return returnVect

def loadImages(dirName):
	"""
	加载图片
	Parameters:
		dirName - 文件夹的名字
	Returns:
	    trainingMat - 数据矩阵
	    hwLabels - 数据标签
	"""
	from os import listdir
	hwLabels = []
	trainingFileList = listdir(dirName)           
	m = len(trainingFileList)
	trainingMat = np.zeros((m,1024))
	for i in range(m):
		fileNameStr = trainingFileList[i]
		fileStr = fileNameStr.split('.')[0]     
		classNumStr = int(fileStr.split('_')[0])
		if classNumStr == 9: hwLabels.append(-1)
		else: hwLabels.append(1)
		trainingMat[i,:] = img2vector('%s/%s' % (dirName, fileNameStr))
	return trainingMat, hwLabels    

def testDigits(kTup=('rbf', 10)):
	"""
	测试函数
	Parameters:
		kTup - 包含核函数信息的元组
	Returns:
	    无
	"""
	dataArr,labelArr = loadImages('trainingDigits')
	b,alphas = smoP(dataArr, labelArr, 200, 0.0001, 10, kTup)
	datMat = np.mat(dataArr); labelMat = np.mat(labelArr).transpose()
	svInd = np.nonzero(alphas.A>0)[0]
	sVs=datMat[svInd] 
	labelSV = labelMat[svInd];
	print("支持向量个数:%d" % np.shape(sVs)[0])
	m,n = np.shape(datMat)
	errorCount = 0
	for i in range(m):
		kernelEval = kernelTrans(sVs,datMat[i,:],kTup)
		predict=kernelEval.T * np.multiply(labelSV,alphas[svInd]) + b
		if np.sign(predict) != np.sign(labelArr[i]): errorCount += 1
	print("训练集错误率: %.2f%%" % (float(errorCount)/m))
	dataArr,labelArr = loadImages('testDigits')
	errorCount = 0
	datMat = np.mat(dataArr); labelMat = np.mat(labelArr).transpose()
	m,n = np.shape(datMat)
	for i in range(m):
		kernelEval = kernelTrans(sVs,datMat[i,:],kTup)
		predict=kernelEval.T * np.multiply(labelSV,alphas[svInd]) + b
		if np.sign(predict) != np.sign(labelArr[i]): errorCount += 1    
	print("测试集错误率: %.2f%%" % (float(errorCount)/m))

if __name__ == '__main__':
	testDigits()

The implementation of the SMO algorithm is the same as above. We have newly created img2vector(), loadImages(), and testDigits() functions, which are used for binary image conversion, image loading, and training SVM classifiers respectively. Our own SVM classifier is a two-class classifier, so when setting the label, 9 is used as the negative class, and the rest 0-8 are used as the positive class for training. This is an 'ovr' idea, that is, one vs rest, which is to classify one category and all remaining categories. If you want to realize the recognition of 10 numbers, a simple method is to train 10 classifiers. For the sake of simplicity here, only one classifier for classifying 9 and all other numbers is trained, and the running results are as follows:
 

The result of the operation is as follows: 

The svm-smo.py code is as follows

# -*-coding:utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
import random


class optStruct:
	"""
	数据结构,维护所有需要操作的值
	Parameters:
		dataMatIn - 数据矩阵
		classLabels - 数据标签
		C - 松弛变量
		toler - 容错率
	"""
	def __init__(self, dataMatIn, classLabels, C, toler):
		self.X = dataMatIn								#数据矩阵
		self.labelMat = classLabels						#数据标签
		self.C = C 										#松弛变量
		self.tol = toler 								#容错率
		self.m = np.shape(dataMatIn)[0] 				#数据矩阵行数
		self.alphas = np.mat(np.zeros((self.m,1))) 		#根据矩阵行数初始化alpha参数为0	
		self.b = 0 										#初始化b参数为0
		self.eCache = np.mat(np.zeros((self.m,2))) 		#根据矩阵行数初始化虎误差缓存,第一列为是否有效的标志位,第二列为实际的误差E的值。

def loadDataSet(fileName):
	"""
	读取数据
	Parameters:
	    fileName - 文件名
	Returns:
	    dataMat - 数据矩阵
	    labelMat - 数据标签
	"""
	dataMat = []; labelMat = []
	fr = open(fileName)
	for line in fr.readlines():                                     #逐行读取,滤除空格等
		lineArr = line.strip().split('\t')
		dataMat.append([float(lineArr[0]), float(lineArr[1])])      #添加数据
		labelMat.append(float(lineArr[2]))                          #添加标签
	return dataMat,labelMat

def calcEk(oS, k):
	"""
	计算误差
	Parameters:
		oS - 数据结构
		k - 标号为k的数据
	Returns:
	    Ek - 标号为k的数据误差
	"""
	fXk = float(np.multiply(oS.alphas,oS.labelMat).T*(oS.X*oS.X[k,:].T) + oS.b)
	Ek = fXk - float(oS.labelMat[k])
	return Ek

def selectJrand(i, m):
	"""
	函数说明:随机选择alpha_j的索引值

	Parameters:
	    i - alpha_i的索引值
	    m - alpha参数个数
	Returns:
	    j - alpha_j的索引值
	"""
	j = i                                 #选择一个不等于i的j
	while (j == i):
		j = int(random.uniform(0, m))
	return j

def selectJ(i, oS, Ei):
	"""
	内循环启发方式2
	Parameters:
		i - 标号为i的数据的索引值
		oS - 数据结构
		Ei - 标号为i的数据误差
	Returns:
	    j, maxK - 标号为j或maxK的数据的索引值
	    Ej - 标号为j的数据误差
	"""
	maxK = -1; maxDeltaE = 0; Ej = 0 						#初始化
	oS.eCache[i] = [1,Ei]  									#根据Ei更新误差缓存
	validEcacheList = np.nonzero(oS.eCache[:,0].A)[0]		#返回误差不为0的数据的索引值
	if (len(validEcacheList)) > 1:							#有不为0的误差
		for k in validEcacheList:   						#遍历,找到最大的Ek
			if k == i: continue 							#不计算i,浪费时间
			Ek = calcEk(oS, k)								#计算Ek
			deltaE = abs(Ei - Ek)							#计算|Ei-Ek|
			if (deltaE > maxDeltaE):						#找到maxDeltaE
				maxK = k; maxDeltaE = deltaE; Ej = Ek
		return maxK, Ej										#返回maxK,Ej
	else:   												#没有不为0的误差
		j = selectJrand(i, oS.m)							#随机选择alpha_j的索引值
		Ej = calcEk(oS, j)									#计算Ej
	return j, Ej 											#j,Ej

def updateEk(oS, k):
	"""
	计算Ek,并更新误差缓存
	Parameters:
		oS - 数据结构
		k - 标号为k的数据的索引值
	Returns:
		无
	"""
	Ek = calcEk(oS, k)										#计算Ek
	oS.eCache[k] = [1,Ek]									#更新误差缓存


def clipAlpha(aj,H,L):
	"""
	修剪alpha_j
	Parameters:
	    aj - alpha_j的值
	    H - alpha上限
	    L - alpha下限
	Returns:
	    aj - 修剪后的alpah_j的值
	"""
	if aj > H: 
		aj = H
	if L > aj:
		aj = L
	return aj

def innerL(i, oS):
	"""
	优化的SMO算法
	Parameters:
		i - 标号为i的数据的索引值
		oS - 数据结构
	Returns:
		1 - 有任意一对alpha值发生变化
		0 - 没有任意一对alpha值发生变化或变化太小
	"""
	#步骤1:计算误差Ei
	Ei = calcEk(oS, i)
	#优化alpha,设定一定的容错率。
	if ((oS.labelMat[i] * Ei < -oS.tol) and (oS.alphas[i] < oS.C)) or ((oS.labelMat[i] * Ei > oS.tol) and (oS.alphas[i] > 0)):
		#使用内循环启发方式2选择alpha_j,并计算Ej
		j,Ej = selectJ(i, oS, Ei)
		#保存更新前的aplpha值,使用深拷贝
		alphaIold = oS.alphas[i].copy(); alphaJold = oS.alphas[j].copy();
		#步骤2:计算上下界L和H
		if (oS.labelMat[i] != oS.labelMat[j]):
			L = max(0, oS.alphas[j] - oS.alphas[i])
			H = min(oS.C, oS.C + oS.alphas[j] - oS.alphas[i])
		else:
			L = max(0, oS.alphas[j] + oS.alphas[i] - oS.C)
			H = min(oS.C, oS.alphas[j] + oS.alphas[i])
		if L == H: 
			print("L==H")
			return 0
		#步骤3:计算eta
		eta = 2.0 * oS.X[i,:] * oS.X[j,:].T - oS.X[i,:] * oS.X[i,:].T - oS.X[j,:] * oS.X[j,:].T
		if eta >= 0: 
			print("eta>=0")
			return 0
		#步骤4:更新alpha_j
		oS.alphas[j] -= oS.labelMat[j] * (Ei - Ej)/eta
		#步骤5:修剪alpha_j
		oS.alphas[j] = clipAlpha(oS.alphas[j],H,L)
		#更新Ej至误差缓存
		updateEk(oS, j)
		if (abs(oS.alphas[j] - alphaJold) < 0.00001): 
			print("alpha_j变化太小")
			return 0
		#步骤6:更新alpha_i
		oS.alphas[i] += oS.labelMat[j]*oS.labelMat[i]*(alphaJold - oS.alphas[j])
		#更新Ei至误差缓存
		updateEk(oS, i)
		#步骤7:更新b_1和b_2
		b1 = oS.b - Ei- oS.labelMat[i]*(oS.alphas[i]-alphaIold)*oS.X[i,:]*oS.X[i,:].T - oS.labelMat[j]*(oS.alphas[j]-alphaJold)*oS.X[i,:]*oS.X[j,:].T
		b2 = oS.b - Ej- oS.labelMat[i]*(oS.alphas[i]-alphaIold)*oS.X[i,:]*oS.X[j,:].T - oS.labelMat[j]*(oS.alphas[j]-alphaJold)*oS.X[j,:]*oS.X[j,:].T
		#步骤8:根据b_1和b_2更新b
		if (0 < oS.alphas[i]) and (oS.C > oS.alphas[i]): oS.b = b1
		elif (0 < oS.alphas[j]) and (oS.C > oS.alphas[j]): oS.b = b2
		else: oS.b = (b1 + b2)/2.0
		return 1
	else: 
		return 0

def smoP(dataMatIn, classLabels, C, toler, maxIter):
	"""
	完整的线性SMO算法
	Parameters:
		dataMatIn - 数据矩阵
		classLabels - 数据标签
		C - 松弛变量
		toler - 容错率
		maxIter - 最大迭代次数
	Returns:
		oS.b - SMO算法计算的b
		oS.alphas - SMO算法计算的alphas
	"""
	oS = optStruct(np.mat(dataMatIn), np.mat(classLabels).transpose(), C, toler)					#初始化数据结构
	iter = 0 																						#初始化当前迭代次数
	entireSet = True; alphaPairsChanged = 0
	while (iter < maxIter) and ((alphaPairsChanged > 0) or (entireSet)):							#遍历整个数据集都alpha也没有更新或者超过最大迭代次数,则退出循环
		alphaPairsChanged = 0
		if entireSet:																				#遍历整个数据集   						
			for i in range(oS.m):        
				alphaPairsChanged += innerL(i,oS)													#使用优化的SMO算法
				print("全样本遍历:第%d次迭代 样本:%d, alpha优化次数:%d" % (iter,i,alphaPairsChanged))
			iter += 1
		else: 																						#遍历非边界值
			nonBoundIs = np.nonzero((oS.alphas.A > 0) * (oS.alphas.A < C))[0]						#遍历不在边界0和C的alpha
			for i in nonBoundIs:
				alphaPairsChanged += innerL(i,oS)
				print("非边界遍历:第%d次迭代 样本:%d, alpha优化次数:%d" % (iter,i,alphaPairsChanged))
			iter += 1
		if entireSet:																				#遍历一次后改为非边界遍历
			entireSet = False
		elif (alphaPairsChanged == 0):																#如果alpha没有更新,计算全样本遍历 
			entireSet = True  
		print("迭代次数: %d" % iter)
	return oS.b,oS.alphas 																			#返回SMO算法计算的b和alphas


def showClassifer(dataMat, classLabels, w, b):
	"""
	分类结果可视化
	Parameters:
		dataMat - 数据矩阵
	    w - 直线法向量
	    b - 直线解决
	Returns:
	    无
	"""
	#绘制样本点
	data_plus = []                                  #正样本
	data_minus = []                                 #负样本
	for i in range(len(dataMat)):
		if classLabels[i] > 0:
			data_plus.append(dataMat[i])
		else:
			data_minus.append(dataMat[i])
	data_plus_np = np.array(data_plus)              #转换为numpy矩阵
	data_minus_np = np.array(data_minus)            #转换为numpy矩阵
	plt.scatter(np.transpose(data_plus_np)[0], np.transpose(data_plus_np)[1], s=30, alpha=0.7)   #正样本散点图
	plt.scatter(np.transpose(data_minus_np)[0], np.transpose(data_minus_np)[1], s=30, alpha=0.7) #负样本散点图
	#绘制直线
	x1 = max(dataMat)[0]
	x2 = min(dataMat)[0]
	a1, a2 = w
	b = float(b)
	a1 = float(a1[0])
	a2 = float(a2[0])
	y1, y2 = (-b- a1*x1)/a2, (-b - a1*x2)/a2
	plt.plot([x1, x2], [y1, y2])
	#找出支持向量点
	for i, alpha in enumerate(alphas):
		if alpha > 0:
			x, y = dataMat[i]
			plt.scatter([x], [y], s=150, c='none', alpha=0.7, linewidth=1.5, edgecolor='red')
	plt.show()


def calcWs(alphas,dataArr,classLabels):
	"""
	计算w
	Parameters:
		dataArr - 数据矩阵
	    classLabels - 数据标签
	    alphas - alphas值
	Returns:
	    w - 计算得到的w
	"""
	X = np.mat(dataArr); labelMat = np.mat(classLabels).transpose()
	m,n = np.shape(X)
	w = np.zeros((n,1))
	for i in range(m):
		w += np.multiply(alphas[i]*labelMat[i],X[i,:].T)
	return w

if __name__ == '__main__':
	dataArr, classLabels = loadDataSet('testSetRBF.txt')
	b, alphas = smoP(dataArr, classLabels, 0.6, 0.001, 40)
	w = calcWs(alphas,dataArr, classLabels)
	showClassifer(dataArr, classLabels, w, b)

Sklearn.svm.SVC

Official English documentation manual: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

The sklearn.svm module provides many models for us to use. This article uses svm.SVC, which is implemented based on libsvm.

 Let us first look at the SVC function, which has 14 parameters:

The parameters are described as follows:

C: Penalty item, float type, optional parameter, the default is 1.0, the larger the C, the greater the penalty for the wrong sample, so the higher the accuracy in the training sample, but the lower the generalization ability, that is, the higher the accuracy of the training sample. The classification accuracy on the test data decreases. On the contrary, if C is reduced, some misclassified and wrong samples are allowed in the training samples, and the generalization ability is strong. For the case of training samples with noise, the latter is generally used, and the misclassified samples in the training sample set are regarded as noise.
kernel: Kernel function type, str type, default is 'rbf'. The optional parameters are:
'linear': linear kernel function
'poly': polynomial kernel function
'rbf': path image kernel function/Gaussian kernel
'sigmod': sigmod kernel function
'precomputed': kernel matrix
precomputed means that the kernel is calculated in advance Function matrix, at this time, the algorithm no longer uses the kernel function to calculate the kernel matrix, but directly uses the kernel matrix you gave, and the kernel matrix needs to be n*n.
degree: The order of the polynomial kernel function, int type, optional parameter, the default is 3. This parameter is only useful for the polynomial kernel function, which refers to the order n of the polynomial kernel function. If the given kernel function parameter is another kernel function, this parameter will be ignored automatically.
gamma: Kernel function coefficient, float type, optional parameter, default is auto. Only valid for 'rbf', 'poly', 'sigmod'. If gamma is auto, it means that its value is the reciprocal of the number of sample features, that is, 1/n_features.
coef0: independent item in the kernel function, float type, optional parameter, the default is 0.0. Only valid for 'poly' and 'sigmod' kernel functions, refers to the parameter c.
probability: Whether to enable probability estimation, bool type, optional parameter, the default is False, which must be enabled before calling fit(), and will slow down the fit() method.
shrinking: whether to use heuristic shrinking method, bool type, optional parameter, the default is True.
tol: the error precision of svm stop training, float type, optional parameter, the default is 1e^-3.
cache_size: memory size, float type, optional parameter, the default is 200. Specify the memory required for training, in MB, the default is 200MB.
class_weight: class weight, dict type or str type, optional parameter, the default is None. Set a different penalty parameter C for each category. If not, all categories will be given C=1, which is the parameter C indicated by the previous parameter. If the argument 'balance' is given, the value of y is used to automatically adjust the weights inversely proportional to the frequency of classes in the input data.
verbose: Whether to enable verbose output, bool type, the default is False, this setting utilizes the per-process runtime settings in libsvm, if enabled, it may not work properly in a multi-threaded context. In general, it is set to False, don't worry about it.
max_iter: the maximum number of iterations, int type, the default is -1, which means unlimited.
decision_function_shape: decision function type, optional parameters 'ovo' and 'ovr', default is 'ovr'. 'ovo' means one vs one, 'ovr' means one vs rest.
random_state: The seed value when data is shuffled, int type, optional parameter, the default is None. The seed for the pseudorandom number generator, used for probability estimation when shuffling data.
In fact, as long as you write the SMO algorithm yourself, you can probably understand the meaning of each parameter.

Write code

SVC is very powerful, we don't need to understand the specific details of the algorithm implementation, and we don't need to understand the optimization method of the algorithm. At the same time, it also meets our multi-classification needs. Create the file svm-svc.py, write the code as follows, and change the data set to the data address just after binarization :

# -*- coding: UTF-8 -*-
import numpy as np
import operator
from os import listdir
from sklearn.svm import SVC


def img2vector(filename):
	"""
	将32x32的二进制图像转换为1x1024向量。
	Parameters:
		filename - 文件名
	Returns:
		returnVect - 返回的二进制图像的1x1024向量
	"""
	#创建1x1024零向量
	returnVect = np.zeros((1, 1024))
	#打开文件
	fr = open(filename)
	#按行读取
	for i in range(32):
		#读一行数据
		lineStr = fr.readline()
		#每一行的前32个元素依次添加到returnVect中
		for j in range(32):
			returnVect[0, 32*i+j] = int(lineStr[j])
	#返回转换后的1x1024向量
	return returnVect

def handwritingClassTest():

	#测试集的Labels
	hwLabels = []
	#返回trainingDigits目录下的文件名
	trainingFileList = listdir('trainingDigits')
	#返回文件夹下文件的个数
	m = len(trainingFileList)
	#初始化训练的Mat矩阵,测试集
	trainingMat = np.zeros((m, 1024))
	#从文件名中解析出训练集的类别
	for i in range(m):
		#获得文件的名字
		fileNameStr = trainingFileList[i]
		#获得分类的数字
		classNumber = int(fileNameStr.split('_')[0])
		#将获得的类别添加到hwLabels中
		hwLabels.append(classNumber)
		#将每一个文件的1x1024数据存储到trainingMat矩阵中
		trainingMat[i,:] = img2vector('trainingDigits/%s' % (fileNameStr))
	clf = SVC(C=200,kernel='rbf')
	clf.fit(trainingMat,hwLabels)
	#返回testDigits目录下的文件列表
	testFileList = listdir('num_two')
	#错误检测计数
	errorCount = 0.0
	#测试数据的数量
	mTest = len(testFileList)
	#从文件中解析出测试集的类别并进行分类测试
	for i in range(mTest):
		#获得文件的名字
		fileNameStr = testFileList[i]
		#获得分类的数字
		classNumber = int(fileNameStr.split('_')[0])
		#获得测试集的1x1024向量,用于训练
		vectorUnderTest = img2vector('num_two/%s' % (fileNameStr))
		#获得预测结果
		# classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
		classifierResult = clf.predict(vectorUnderTest)
		print("分类返回结果为%d\t真实结果为%d" % (classifierResult, classNumber))
		if(classifierResult != classNumber):
			errorCount += 1.0
	print("总共错了%d个数据\n错误率为%f%%" % (errorCount, errorCount/mTest * 100))
if __name__ == '__main__':
	handwritingClassTest()

The running results are as follows: here only 4 numbers are predicted correctly. I guess the numbers drawn by myself may be relatively thin, so the effect is not very good.


Summarize 

Advantages and disadvantages of SVM


advantage

  • It can be used for linear/nonlinear classification, and can also be used for regression. The generalization error rate is low, that is to say, it has good learning ability, and the learned results are very generalizable.
  • It can solve machine learning problems in the case of small samples, solve high-dimensional problems, and avoid neural network structure selection and local minimum point problems.
  • SVM is the best off-the-shelf classifier, off-the-shelf means that it can be used directly without modification. And can get a lower error rate, SVM can make good classification decisions for data points outside the training set.

shortcoming

  • Sensitive to parameter tuning and choice of sum function.

Reference article:

(1 message) Python3 "Machine Learning Practical Combat" Study Notes (9): Support Vector Machine Practical Chapter and then tearing nonlinear SVM_Jack-Cui's Blog-CSDN Blog

Guess you like

Origin blog.csdn.net/weixin_45897172/article/details/131021053