PyTorch Deep Learning Practice | Iris Classification Based on Linear Regression, Decision Tree and SVM

The iris data set is a very classic classification task data set in the field of machine learning. Its English name is Iris Data Set, which can be downloaded and imported directly using the sklearn library. The data set contains a total of 150 rows of data, and each row of data consists of 4 feature values ​​and a label. The labels are three different categories of irises: Iris Setosa, Iris Versicolour, Iris Virginica.

For multi-classification tasks, there are more machine learning algorithms that can support it. This article will use a variety of algorithms such as decision trees, linear regression, and SVM to accomplish this task, and compare different methods.

01. Using Logistic to realize iris flower classification

As mentioned earlier, Logistic is used for binary classification tasks, and its extension is also used for multi-classification tasks. The following will use the sklearn library to complete a Logistic-based iris classification task. As shown in Code Listing 1, the first step is to import the sklearn.datasets package to load the dataset, and randomly divide the dataset into a training set and a test set according to the test set ratio of 0.2.

Code Listing 1 import package and load dataset

rom sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.preprocessing import label_binarize
from sklearn.metrics import confusion_matrix, precision_score, accuracy_score,recall_score, f1_score, roc_auc_score, \
    roc_curve
import matplotlib.pyplot as plt
 
# 加载数据集
def loadDataSet():
    iris_dataset = load_iris()
    X = iris_dataset.data
    y = iris_dataset.target
    # 将数据划分为训练集和测试集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    return X_train, X_test, y_train, y_test

As shown in Listing 2, write a function to train the Logistic model.

Code Listing 2 Training Logistic Model

# 训练Logistic模性
def trainLS(x_train, y_train):
    # Logistic生成和训练
    clf = LogisticRegression()
    clf.fit(x_train, y_train)
    return clf

The Logistic model is relatively simple, and training can start without additional setting of hyperparameters. As shown in Listing 3, initialize the Logistic model and train the model on the training set, and return the trained model.

Code Listing 3 Test model and print various evaluation indicators

# 测试模型
def test(model, x_test, y_test):
    # 将标签转换为one-hot形式
    y_one_hot = label_binarize(y_test, np.arange(3))
    # 预测结果
    y_pre = model.predict(x_test)
    # 预测结果的概率
    y_pre_pro = model.predict_proba(x_test)
 
    # 混淆矩阵
    con_matrix = confusion_matrix(y_test, y_pre)
    print('confusion_matrix:\n', con_matrix)
    print('accuracy:{}'.format(accuracy_score(y_test, y_pre)))
    print('precision:{}'.format(precision_score(y_test, y_pre, average='micro')))
    print('recall:{}'.format(recall_score(y_test, y_pre, average='micro')))
    print('f1-score:{}'.format(f1_score(y_test, y_pre, average='micro')))
 
    # 绘制ROC曲线
    drawROC(y_one_hot, y_pre_pro)

When predicting the result, in order to facilitate the drawing of the ROC curve later, it is necessary to first convert the label of the test set into a one-hot form, and obtain the probability value of the model’s prediction result on the test set, that is, y_pre_pro, and then pass it into the drawROC function to complete the ROC curve of drawing. In addition, this function implements the functions of outputting confusion matrix and calculating accuracy, precision, recall and f1-score.

Code Listing 4 Draw the ROC curve

def drawROC(y_one_hot, y_pre_pro):
    # AUC值
    auc = roc_auc_score(y_one_hot, y_pre_pro, average='micro')
    # 绘制ROC曲线
    fpr, tpr, thresholds = roc_curve(y_one_hot.ravel(), y_pre_pro.ravel())
    plt.plot(fpr, tpr, linewidth=2, label='AUC=%.3f' % auc)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1.1, 0, 1.1])
    plt.xlabel('False Postivie Rate')
    plt.ylabel('True Positive Rate')
    plt.legend()
    plt.show()

As shown in Code Listing 4, the code implementation for drawing the ROC curve is shown. Finally, connect the entire process of loading the dataset, training the model, and model verification to realize the main function, as shown in Listing 5.

Code Listing 5 main function settings

if __name__ == '__main__':
    X_train, X_test, y_train, y_test = loadDataSet()
    model = trainLS(X_train, y_train)
    test(model, X_test, y_test)

Put all the above codes in the same py script file, as shown in Figure 1, the final output result is

Figure 1 Test results printed from the command line

The drawn ROC curve is shown in Figure 2.

Figure 2 ROC curve

Logistic is a relatively simple model with fewer parameters. It is generally used in relatively simple classification tasks. When the task is more complex, a more complex model can be selected to obtain better results. Different models will be used below In order to verify the performance of the same task under different models.

02. Use decision tree to realize iris classification

Since only the model has been changed, other codes such as loading data sets and model evaluation do not need to be changed. As shown in Listing 6, a new function is added for training the decision tree model.

Code Listing 6 uses the decision tree model for training

from sklearn import tree
# 训练决策树模性
def trainDT(x_train, y_train):
    # DT生成和训练
    clf = tree.DecisionTreeClassifier(criterion="entropy")
    clf.fit(x_train, y_train)
    return clf

At the same time, modify the training function called in the main function, as shown in Listing 7.

Code Listing 7 Modify the content of the main function

if __name__ == '__main__':
    X_train, X_test, y_train, y_test = loadDataSet()
    model = trainDT(X_train, y_train)
    test(model, X_test, y_test)

Finally, run the command line output as shown in Figure 3.

Figure 3 Decision tree model prediction results

And the ROC curve is shown in Figure 4.

Figure 4 Decision tree model drawing ROC curve

Compared with the Logistic model, the decision tree model has a higher score no matter which indicator, and the decision tree model will not be affected by the initialization like the Logistic model, and the same output model can be obtained by running the program multiple times. However, if the Logistic model is run many times, it will be found that the evaluation index will fluctuate up and down within a certain range.

03. Use SVM to realize iris classification

Up to now, I believe that everyone is very familiar with how to continue to modify the code to realize the prediction of the SVM model. The training code to realize the SVM model is shown in Listing 8.

Code Listing 8 uses SVM model for training

# 训练SVM模性
from sklearn import svm
def trainSVM(x_train, y_train):
    # SVM生成和训练
    clf = svm.SVC(kernel='rbf', probability=True)
    clf.fit(x_train, y_train)
    return clf

At the same time, modify the main function, as shown in Listing 9.

Code Listing 9 Modify the content of the main function

if __name__ == '__main__':
    X_train, X_test, y_train, y_test = loadDataSet()
    model = trainSVM(X_train, y_train)
    test(model, X_test, y_test)

The output of the program operation is shown in Figure 5.

Figure 5 Prediction results using the SVM model

The drawn ROC curve is shown in Figure 6.

Figure 6 ROC curve drawn using the SVM model

It can be found that as the model becomes more complex, the final prediction indicators further increase. Among the three models, the final result of the Gaussian kernel of the SVM model is the best in the test set and there is no overfitting phenomenon, so it can be The SVM model is selected to complete the task of iris classification.

Guess you like

Origin blog.csdn.net/qq_41640218/article/details/130053658