Python wine classification based on SVM and KNN algorithm

Python machine learning wine classification problem


Preface

It's another big assignment after the exam. Yes, it is data mining. The teacher provided two questions: wine classification problem and breast cancer diagnosis problem for everyone to choose from, both of which are very classic machine learning cases. At the same time, the teacher also suggested that everyone choose the more difficult red wine classification problem. However, when I wrote it at the time, the problem was the red wine classification problem, because this kind of problem is the most favorite in the mathematical modeling competition.
Not much to say, start sharing!

1. What are the problems and goals

1. Original title

Red wine classification problem A
research obtained several red wine category data and stored them in winedata.txt.
The first attribute of each sample is the category (1 or 2 or 3), and the others have the following 13 attributes in order. Here is omitted... is the sample attribute.
Please select 100 samples from the data set as training samples, and the rest as test samples.
And choose two classifiers for classification (the selection range includes but not limited to decision trees, naive Bayes classifiers, artificial neural network networks, support vector machines). Claim

  1. For both classifiers, try A. Reclassify the data after dimensionality reduction and B. Direct classification.
  2. Compare the accuracy of the four combinations (classifier 1 dimensionality reduction classification, classifier 1 direct classification, classifier 2 dimensionality reduction classification, and classifier 2 direct classification) on the test samples.
  3. The experiment report explains how to choose training samples and test samples
  4. In order to adapt the data to the selected classifier, appropriate data preprocessing is required

2. Topic analysis

Obviously this is a three-category problem. You can use classification algorithms such as support vector machines and neural networks. However, it should be noted that the data provided here are all numerical and unlabeled. It is not suitable to use Bayesian, decision tree and other algorithms.
Make the following analysis for the subject requirements:

  1. The dimensionality reduction should be reduced to several dimensions. Generally speaking, it is 2 or 3 dimensions. For the convenience of drawing and presenting the effect, choose 2 dimensions;
  2. For data preprocessing operations, whether to perform standardization processing depends on the characteristics of the classifier;
  3. For the division of data, the 100 sample training set should cover each type of sample. Here I use the average division. Readers can also divide randomly within a specific area, but this is correspondingly more complicated to implement.
    Importing the data set shows that the first 59 samples are all the first category, the middle 71 samples are the second category, and the last 48 samples are the third category, so it is divided into the training data set according to the ratio of 100:178, and the first category data volume It is 33, the second type is 40, the third type is 27, and the rest is the test data set.

2. Introduction to the algorithm

The selected SVM model and KNN model represent two extremes. In machine learning, SVM support vector machine is known for its algorithm model robustness and powerful core algorithm, while KNN is recognized as the simplest machine learning algorithm with simple algorithm principles. Support Vector Machine (SVM) is a new machine learning method developed on the basis of the VC dimension theory of statistical learning theory and the principle of minimum structural risk. Among them, the kernel function and the maximum interval classification hyperplane technology allow SVM is extremely powerful.
The principle of the KNN algorithm is simple. Generally speaking, when predicting a new value x, determine which category x belongs to according to the category of the nearest K points. This can also be done after studying the source code of python related modules. It can be seen that KNN hardly does anything when building the model, because KNN is a non-parametric, lazy algorithm model.
Focus on KNN, quote the introduction of Baidu Encyclopedia:

The core idea The core idea of ​​the
KNN algorithm is that if most of the K nearest samples in the feature space of a sample belong to a certain category, the sample also belongs to this category and has the characteristics of the samples in this category. In determining the classification decision, this method only determines the category of the sample to be classified based on the category of the nearest one or several samples. The KNN method is only related to a very small number of adjacent samples when making category decisions. Since the KNN method mainly relies on the surrounding limited nearby samples, rather than the method of discriminating the class domain to determine the category, the KNN method is better than other methods for the sample set to be divided with more crossover or overlap of the class domain. For suitable [3].
Algorithm flow
Generally speaking, the KNN classification algorithm includes the following 4 steps:
①Prepare data and preprocess the data.
②Calculate the distance from the test sample point (that is, the point to be classified) to each other sample point.
③ Sort each distance, and then select the K points with the smallest distance.
④Compare the categories to which the K points belong, and according to the principle that the minority obeys the majority, the test sample points are classified into the category with the highest proportion among the K points.
Advantages The
KNN method has a simple idea, easy to understand, easy to implement, and no need to estimate Parameters, no training required.
Disadvantages
The main disadvantage of this algorithm in classification is that when the samples are unbalanced, for example, the sample size of one class is very large, while the sample size of other classes is very small, which may cause the K of the sample when a new sample is input. The large-capacity samples in the neighbors account for the majority.
Another shortcoming of this method is the large amount of calculation, because for each text to be classified, the distance to all known samples must be calculated to find its K nearest neighbors.

Three, code implementation

1. Algorithm flow framework

Insert picture description here

2. Third-party library calls

import numpy as np
from sklearn import svm
from sklearn.datasets import load_wine
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

Introduction to the third-party library:
1. The SVC in svm implements multiple classifiers;
2. load_wine imports the data set. I didn’t expect it. In fact, the data set given by the teacher is the same as the data in this library, but the teacher has changed it himself Label number, change the label number from 0-2 to 1-3;
3. PCA realizes dimensionality reduction;
4. matplotlib.pyplot draws a scatter plot;
5. accuracy_score outputs the accuracy rate of the prediction result, classification_report generates a classification report (including F value, recall rate, etc.);
6. StandardScaler standardized data
7. KNeighborsClassifier implements K nearest neighbor algorithm

3. Source code

def pca_show(pca_data, target):  # 绘制PCA降维后图像
    color = ['r', 'g', 'b']  # 图像点颜色
    marker = ['s', 'x', 'o']  # 图像点样式
    for lb, c, m in zip(np.unique(target), color, marker):  # 绘制数据点
        plt.scatter(pca_data[target == lb, 0],
                    pca_data[target == lb, 1],
                    c=c, label=lb, marker=m)
    plt.title('Result')
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.legend(loc='upper right')
    plt.show()


def data_split(data, target):
    """
    数据集前59个样本全是第1类,中间71个样本为第2类,最后48个样本是第3类
    分布区间为 0:58 , 59:129, 130:177 左右都是闭区间
    按照100:178比例划分数据集,第1类数据量为59*100/178 = 33,同理第2类为40,第3类为27
    故训练集区间为 0:33,59:99,130:157  左闭右开
    测试集区间为 33:59,99:130,157:177  左闭右开
    """

    def cell_concatenate(data_tuple):  # data_tuple内的数据连接
        return np.concatenate(data_tuple, axis=0)

    # 数据分解
    train_data = cell_concatenate((data[0:33, :], data[59:99, :], data[130:157, :]))  # 训练集
    train_target = cell_concatenate((target[0:33], target[59:99], target[130:157]))  # 样本类别
    test_data = cell_concatenate((data[33:59, :], data[99:130, :], data[157:, :]))  # 测试集
    test_target = cell_concatenate((target[33:59], target[99:130], target[157:]))  # 样本类别
    # print(train_data.shape)
    # print(test_data.shape)
    return train_data, train_target, test_data, test_target


def svm_classifier(train_X, train_Y, test_X, test_Y, title):
    print("SVM分类器", title)
    svm_clf = svm.SVC(kernel='linear', C=1000.)
    svm_clf.fit(train_X, train_Y) # 训练数据集
    predict_Y = svm_clf.predict(test_X) # 预测
    print("训练准确率为{:.3f}%".format(accuracy_score(test_Y, predict_Y) * 100))
    print(classification_report(test_Y, predict_Y))


def knn_classifier(train_X, train_Y, test_X, test_Y, title):
    print("KNN分类器", title)
    knn = KNeighborsClassifier(algorithm='auto', leaf_size=10, metric='minkowski',
                               metric_params=None, n_jobs=1, n_neighbors=2, p=2,
                               weights='uniform')
    knn.fit(train_X, train_Y) # 加载数据集
    predict_Y = knn.predict(test_X)
    print("训练准确率为{:.3f}%".format(accuracy_score(test_Y, predict_Y) * 100))
    print(classification_report(test_Y, predict_Y))
    # if "PCA" in title:
    #     pca_show(test_X,predict_Y)


if __name__ == '__main__':
    wine_dataset = load_wine()  # 导入红酒数据集,数据为字典形式,数据集在data键中,标签在target键中
    print("初始化完成")
    sc = StandardScaler()  # 数据标准化处理
    wine_data_std = sc.fit_transform(wine_dataset['data'])
    pca = PCA(n_components=2)  # PCA降维降至2维
    pca.fit(wine_data_std)  # PCA训练
    wine_data_pca = pca.fit_transform(wine_data_std)
    pca_show(wine_data_pca, wine_dataset['target']) # 展示数据图像

    # 原始数据划分后的训练、测试数据集
    train_X, train_Y, test_X, test_Y = data_split(wine_dataset['data'], wine_dataset['target'])
    # 标准化后的训练、测试数据集
    train_X_std, train_Y_std, test_X_std, test_Y_std = data_split(wine_data_std, wine_dataset['target'])
    # 降维后的训练、测试数据集
    pca_train_X, pca_train_Y, pca_test_X, pca_test_Y = data_split(wine_data_pca, wine_dataset['target'])

    svm_classifier(train_X, train_Y, test_X, test_Y, title="原始数据")
    svm_classifier(train_X_std, train_Y_std, test_X_std, test_Y_std, title="标准化后数据")
    svm_classifier(pca_train_X, pca_train_Y, pca_test_X, pca_test_Y, title="PCA降维后数据")
    knn_classifier(train_X, train_Y, test_X, test_Y, title="原始数据")
    knn_classifier(train_X_std, train_Y_std, test_X_std, test_Y_std, title="标准化后数据")
    knn_classifier(pca_train_X, pca_train_Y, pca_test_X, pca_test_Y, title="PCA降维后数据")

to sum up

Generally speaking, this case is not difficult. As long as you have mastered the basic algorithm principle and can call the third-party library to implement the corresponding algorithm, you must refer to the official documentation or the project source code of the third-party library for API learning. The main trouble lies in data processing and comparative analysis of results. This requires readers to complete it independently. I only provide most of the usable functions and method frameworks. The details need to be improved by yourself.
Welcome to discuss and exchange with me, and hope to criticize and correct the mistakes. If you see here, just like it, it’s very important to me!

Reference website

Red wine data analysis based on SVM classifier
KNN red wine data set classification-machine learning
principal component analysis PCA data dimensionality reduction principle and python application (wine case analysis)

Guess you like

Origin blog.csdn.net/weixin_43594279/article/details/112987098