Feature extraction and classification experiment of orchid Iris dataset based on SVM

Feature extraction and classification experiment of orchid Iris dataset based on SVM

Introduction to the experiment

  The data set used in this experiment is very classic, and the experiment itself is based on SVM support vector machine technology to classify the features of the data set Iris. The experiment uses the Sklearn function library to implement SVM, and uses SVM to classify the extracted features. Data visualization is performed in the display of the results to ensure that the observation results are clearly visible.

  First, the Chinese name of the Iris data set is the Anderson Iris flower data set. Iris contains 150 samples, corresponding to each row of data in the data set. Each row of data contains the four characteristics of each sample and the category information of the sample, so the iris data set is a sample in the form of a two-dimensional table with 150 rows and 5 columns, which is applied to the experiment of multiple types of models.

  The content of the data set needs to be further introduced: in the data set, each sample contains four features (the first 4 columns) of calyx length, calyx width, petal length, and petal width. We need to build a classifier, which can pass samples The four characteristics of the sample are used to determine whether the sample belongs to the mountain iris, the color iris or the Virginia iris ('Iris-setosa','Iris-versicolor','Iris-virginica'). Of course, the data set is classified as str by default, which is located in the fifth column. Since the category label in the classification must be a number, the category (string) in the fifth column in Iris.data should be converted to a number. This is a process that requires manual cleaning.

  After the data is cleaned, the data set is used to construct the classifier. Sklearn is very convenient to divide the original data set into a training set and a test set according to the proportion according to the category label, according to the default proportion and personal settings during this experiment. Proportion of 5 groups of control experiments. After dividing the training set and the test set, you can start training the SVM model. The choice of kernel function and parameters will directly affect the generalization error and prediction accuracy of the model. Here are divided into four groups according to the previously written articles Make a comparison. After the model is trained, the prediction area and samples are presented using visualization methods, and the prediction accuracy of the model can be observed more intuitively. The control situation of the specific control group and the specific experimental process will be written in the following text.

lab environment

Software Environment:

  Windows 10 operating system, Pycharm IDE environment.

Hardware environment:

  12-thread CPU, 16G memory, 6G video memory

Third-party library:

  1. Sklearn function library (used to build SVM model and data set division) = Sci-Kit learn library
  2. Numpy function library (used for basic data cleaning and data format processing will be used in the data visualization process)
  3. Matplotlib function library (for data visualization process to facilitate comparison of experimental results)

Experiment goal

  We have now obtained a set of orchid data set, which contains three different kinds of orchids. We plan to use SVM to classify them, and use this set of experiments to evaluate the SVM classification effect and parameter settings to generalize the model The influence of error.

experiment procedure

  First of all, the data set is a file in .data format. According to observation and understanding, it includes 4-dimensional feature values ​​and 1-dimensional flags. The flags are still given in the form of strings, so you need to perform the data set For cleaning and feature extraction, in order to facilitate the later visual display, only the first two-dimensional features are used for the experiment during this experiment.

  Data cleaning can be achieved through dictionary filtering, using the parameter converters in the loadtxt function to achieve dictionary conversion. The feature extraction process of the fifth column uses the split function in the numpy library. This function can separate the first four-dimensional features from the latter part. Here is a not-so-common slice notation (4,), here Make a special note. The meaning of the specific representation is that the center points of the 4,5 columns are not divided into two arrays and stored in two parameters (note that the result is two outputs). So we can use this function to separate the data and the logo, and keep their one-to-one correspondence, with the relative position unchanged.

  In order to have significant results in the experiment, and to ensure that the training set and the test set have good data consistency, use the train_test_split function to divide the original data set into the training set and the test. This method has a default allocation ratio (training Set: test set=7.5:2.5), but it can also be modified by specifying parameters. In this experiment, 5 groups of proportions were taken for comparison. Since the data set contains two parts (data and label), when the separation is over, 4 variables are needed to receive the return value (train_data, test_data, train_label, test_label).

  When the above processing of the data is over, you can start to build the SVM classifier. Sklearn provides a good classifier implementation. I will introduce its parameter configuration and experimental control design here. The implementation of the classifier contains a total of 3 main parameters (training correspondence decision_function_shape, learning step size c and kernel function selection kernel). When the selected kernel is an rbf Gaussian kernel, it is also necessary to configure the gamma radial basis kernel speed parameter. In this experiment, the step length is set to 5 and the corresponding relationship is one-to-many ovr, and these two variables remain unchanged as control variables. The control group used Gaussian kernels (different gamma) and linear kernels to conduct control design experiments.

  After designing the control group, run out the experimental data of different control groups for comparison. At the same time, a visual display is given.

Experiment code

  The import part of the third-party library:

# Author:JinyuZ1996
# Creation date:2020/8/10 10:24
# -*- coding:utf-8 -*-

from sklearn import svm
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from sklearn.model_selection import train_test_split

  Function definition part: (The lack of function encapsulation in this experiment is a shortcoming and later improvement)

# 函数定义部分

# 定义转换器字典用于转换对应兰花数字(相当于loadtxt函数转换器参数converters的字典)
def Sort_dic(type):
    it = {b'Iris-setosa': 0, b'Iris-versicolor': 1, b'Iris-virginica': 2}
    return it[type]

  Specific implementation part: (In fact, I focused on the following visualization part of this experiment and set up a control)

  Regarding the next part of the classifier definition: I need to talk about the parameters in detail. Although the part mentioned above, the list is not detailed enough:

  First, regarding the selection of the kernel function, when kernel='linear', it is a linear kernel. The larger the C, the better the classification effect, but it may overfit (defaul C=1). When kernel='rbf' (default), it is a Gaussian kernel, the smaller the gamma value, the more continuous the classification interface; the larger the gamma value, the more "scattered" the classification interface, the better the classification effect, but it may overfit (It can be seen in the figure below).

  Secondly, regarding the correspondence between decision functions, when decision_function_shape='ovr', it is one v rest (one-to-many), that is, one category is divided from other categories, and when decision_function_shape='ovo', it is one v one (one For one), that is to divide the categories between two, and use the two-category method to simulate the results of multiple categories.

# 具体实现部分

# 读取数据集的数据并进行简单清洗

path = 'Iris.data'
# converters是数据转换器定义,将第5列的花名格式str转化为0,1,2三种数字分别代表不同的类别兰花类别(这是一步数据清洗过程)
data = np.loadtxt(path, dtype=float, delimiter=',', converters={4:Sort_dic})

# 将数据和标签列划分开来

# split函数的参数意义(数据,分割位置(这里用了一种不常见的写法表示前四列为一组记作x,后面剩余部分为一组记作y),
# axis = 1(代表水平分割,以每一个行记录为切割对象) 或 0(代表垂直分割,以属性为切割对象))。
x, y = np.split(data, indices_or_sections=(4,), axis=1)  # x为数据,y为标签
# 为便于后边画图显示,只选取前两维度。若不用画图,可选取前四列x[:,0:4]就选中所有特征了
x = x[:,0:2]   # 标记一下,这个切片的意思是提取前两列的每一行[每一行,0,1两列]
# Sklearn库函数train_test_split可以实现将数据集按比例划分为训练集和测试集
train_data, test_data, train_label, test_label = train_test_split(x, y, random_state=1, train_size=0.8,test_size=0.2)
# 目前x为数据y为标签(即标注样本属于哪一类)

# 定义SVM分类器,希望大家还记得之前我们讲过的rbf核是什么

# C越大分类效果越好,但有可能会过拟合,gamma是高斯核参数,而后面的dfs制定了类别划分方式,ovr是一对多方式。
classifier = svm.SVC(C=5, kernel='rbf',gamma=20 ,decision_function_shape='ovr')
# 这里分类器的参数关系我会在实验报告中给出,比较复杂代码注释中不做详述
classifier.fit(train_data, train_label.ravel())  # 用训练集数据来训练模型。(ravel函数在降维时默认是行序优先)

# 计算svc分类器的准确率

print("Training_set_score:", format(classifier.score(train_data, train_label),'.3f'))
print("Testing_set_score:", format(classifier.score(test_data, test_label),'.3f'))

# 绘制图形将实验结果可视化(注意现在我们就挑了前二维特征来画图,好画,事实上该数据集有四维特征呢,不好画)

# 首先确定坐标轴范围,通过二维坐标最大最小值来确定范围
# 第1维特征的范围(花萼长度)
x1_min = x[:, 0].min()
x1_max = x[:, 0].max()
# 第2维特征的范围(花萼宽度)
x2_min = x[:, 1].min()
x2_max = x[:, 1].max()
# mgrid方法用来生成网格矩阵形式的图框架
x1, x2 = np.mgrid[x1_min:x1_max:200j, x2_min:x2_max:200j]                   # 生成网络采样点(其实是颜色区域),先沿着x1向右扩展,再沿着x2向下扩展
grid_test = np.stack((x1.flat, x2.flat), axis=1)                            # 再通过stack()函数,axis=1,生成测试点,其实就是合并横与纵等于计算x1+x2
grid_value = classifier.predict(grid_test)                                    # 用训练好的分类器去预测这一片面积内的所有点,为了画出不同类别区域
grid_value = grid_value.reshape(x1.shape)                                       # (大坑)使刚刚构建的区域与输入的形状相同(裁减掉过多的冗余点,必须写不然会导致越界读取报错,这个点的bug非常难debug)
# 设置两组颜色(高亮色为预测区域,样本点为深色)
light_camp = matplotlib.colors.ListedColormap(['#FFA0A0', '#A0FFA0', '#A0A0FF'])
dark_camp = matplotlib.colors.ListedColormap(['r', 'g', 'b'])
fig = plt.figure(figsize=(10, 5))                                           # 设置窗体大小
fig.canvas.set_window_title('SVM -2 feature classification of Iris')        # 设置窗体title
# 使用pcolormesh()将预测值(区域)显示出来
plt.pcolormesh(x1, x2, grid_value, cmap=light_camp)
plt.scatter(x[:, 0], x[:, 1], c=y[:, 0], s=30, cmap=dark_camp)              # 加入所有样本点,以深色显示
plt.scatter(test_data[:, 0], test_data[:, 1], c=test_label[:, 0], s=30, edgecolors='white', zorder=2,cmap=dark_camp)
# 单独再把测试集样本点加一个圈,更加直观的查看命中效果
# 设置图表的标题以及x1,x2坐标轴含义
plt.title('SVM -2 feature classification of Iris')
plt.xlabel('length of calyx')
plt.ylabel('width of calyx')
# 设置坐标轴的边界
plt.xlim(x1_min, x1_max)
plt.ylim(x2_min, x2_max)
plt.show()

  Regarding the data visualization part, I will add some more, you will find that I use grid here, this is from the numpy library function, explain in detail:

  To understand the parameters of mgrid, we first assume that an objective function f(x, y) = x + y is to be realized. The range of x-axis is 1~3, and the range of y-axis is 4~6. When drawing an image, there are four main steps: (refer to https://blog.csdn.net/u012679707/article/details/80501358 blogger’s explanation , For the same data set, I borrowed data visualization methods from my predecessors and learned from the shoulders of giants)

step1: Expansion in the x direction (that is, expansion along the x axis to the right):

      [1 1 1]

      [2 2 2]

      [3 3 3]

step2: expansion in the y direction (that is, expansion down along the y axis):

      [4 5 6]

      [4 5 6]

      [4 5 6]

step3: positioning (xi, yi) (in fact, it is to merge two matrices):

    [(1,4) (1,5) (1,6)]

    [(2,4) (2,5) (2,6)]

    [(3,4) (3,5) (3,6)]

step4: Substitute (xi, yi) into F(x,y) = x+y to express

  Therefore, the result of x1, x2 = np.mgrid[x1_min:x1_max:200j, x2_min:x2_max:200j] can be regarded as the result of combining the two matrices to create an area full of dots.

Analysis and conclusion of experimental results

  First, let's take a look at the results of the data set division and comparison. At this time, we stipulate that C=5, the core is rbf, ovr mode, and gamma=10.

Chart1: Data set division comparison result

  According to the results of the division, when the ratio of training set to test set is approaching 75~80%: 25~20%, the result begins to stabilize. At the same time, in order to avoid too little test set data and lack of generality, it can be considered that the default 7.5:2.5 The ratio has a relatively good level of observation (the following experimental control groups are all carried out with 7.5:2.5).

       Next, start to compare and observe the classifier models with different parameters:

Chart2: Classifier comparison results

  We are the same as the previous data processing method for handwriting input recognition, mainly focusing on the experimental effect of the radial basis kernel (because of the fast speed), of course, we also added a set of linear kernels for comparison. From the data alone, it seems that when gamma is set to 5, the test set shows a good prediction effect, but it exceeds the training set score. We think this is unreliable. In fact, it is easy to understand that if the gamma is too low, it will This leads to underfitting of the radial basis kernel (of course too high will also overfit), so the result is unreliable, and it cannot be considered that good prediction results and model parameters have been obtained. In fact, this can also be intuitively felt from the figure below, because we have drawn the prediction area, and this area has a low degree of overlap with the part of our final prediction, which is under-fitting.

  Therefore, what we should pay attention to is that when the gamma reaches 10-20, the test set score tends to be stable. We believe that the gamma value in this interval is relatively reliable. This can also be confirmed by the following figure. Interestingly, I introduced a set of linear kernels and found that the performance is not very bad (either in the training set or in the test set). Perhaps when practical applications require it, I will also choose to sacrifice the error rate of the linear kernel to improve the classification speed. . This is often acceptable in real applications. I also mentioned this part in my previous article, the handwriting input recognition case report.

  Next, I will show the prediction effect images of different parameter control groups. On the one hand, this can help us observe the results more intuitively. On the other hand, this is also to confirm the conclusion I just reached: (Three different flower categories use three colors Represents, the highlighted area is the prediction area constructed by the grid (that is, our model), the color of the sample point represents the label color in its data set, and the white ring is the test set sample)

Chart3: RBF core Gamma = 1 experimental results

 

Chart4: RBF core Gamma = 5 experimental results

  

Chart5: RBF core Gamma = 10 experimental results

 

Chart6: RBF core Gamma = 20 experimental results

 

Chart7: Linear nuclear experiment results

 

Data set used in this experiment 

  https://download.csdn.net/download/qq_39381654/12710878

Guess you like

Origin blog.csdn.net/qq_39381654/article/details/107992420