Machine Learning---Confusion Matrix Code

1. Guide package:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, f1_score

2. Import data: 

# 导入数据
file = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data',
                 header=None)
df=file
# 提取DataFrame中所有行的第2列及之后所有列的操作,将其转换为一个NumPy数组
X = df.loc[:, 2:].values
y = df.loc[:, 1].values

le = LabelEncoder()
y = le.fit_transform(y)  # 类标整数化

dfIs a DataFrame object and locis a method used to select data by label. [:, 2:]Indicates selecting all rows (colon indicates

all rows) and all columns from column 2 to the last column. valuesConvert selected data to a NumPy array.

LabelEncoder()Is a class in the scikit-learn library for label encoding. It converts a set of text or categorical labels from 0

The starting consecutive integer value. The method fits and transforms fit_transform()the original labels . yThe fitting operation will be based on the data

The tags that appear automatically build coding mapping relationships. The conversion operation converts the original tag into the corresponding encoded value. Eventually, the variable ywill contain

Contains the encoded tag value. fit_transform()The method performs both fitting and transformation operations. If you just need to convert,

transform()Methods can be used . Additionally, LabelEncoder()label encoding applies to individual columns. If you have a lot of data

columns, you may want to consider using OneHotEncoder()or other encoding methods to handle categorical features.

LabelEncoder()Example code for class:

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

# 假设有一个标签列表
labels = ['cat', 'dog', 'cat', 'bird', 'dog']

# 使用LabelEncoder进行标签编码
encoded_labels = le.fit_transform(labels)

print(encoded_labels)
# [0 1 0 2 1]

3. Divide the training set into the test set

# 划分训练集合测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
# 建立pipeline
pipe_svc = Pipeline([('scl', StandardScaler()), ('clf', SVC(random_state=1))])
pipe_svc.fit(X_train, y_train)
y_pred = pipe_svc.predict(X_test)

train_test_splitis a function in the scikit-learn library that is used to divide the data set into a training set and a test set. it can root

Randomly divides data according to a specified ratio or sample size while maintaining its distribution.

In the code, Xit is the feature data set and ythe corresponding label data set. test_size=0.20Indicates dividing the data into 80% training

set and 20% of the test set. random_state=1Used to set a random seed to ensure consistent division results each time it is run.

The division results will be stored in the four variables of X_train, X_test, y_trainand respectively, which represent the characteristics of the training set respectively.y_test

features, test set features, training set labels, and test set labels.

PipelineThe function is a class in the scikit-learn library, used to create a pipeline object to process multiple data

The steps come together. Pipeline objects can conveniently organize multiple data preprocessing and model training steps into a whole, simplifying

Coding and managing machine learning pipelines. It has the following functions:

①Data pipeline: Pipelines can combine multiple data processing steps in sequence to form a data pipeline. every step

It can be a data preprocessing operation, such as feature scaling, feature selection, feature extraction, etc.

②Parameter sharing: Through pipelines, parameters can be shared in each step. For example, in the data normalization and model training steps

Share the same parameter settings and avoid manually adjusting parameters for multiple steps.

③Code simplification: Using pipelines can integrate multiple processing steps into one object, simplifying code writing and maintenance. by adjusting

Using the method of the pipeline object, you can complete the entire process at once without calling each step separately.

⑤Model persistence: The entire pipeline is persisted as a model object. The entire pipeline can be saved to disk,

Convenient for subsequent deployment and use.

In the code, the first step of the pipeline StandardScaler(), "scl" refers to the preprocessing step that scales the data.

"StandardScaler" is a preprocessor used to standardize data. It will standardize each feature of the data in a certain way.

Scale so that the data conforms to a standard normal distribution with mean 0 and variance 1. The purpose of this step is to pass data standards

ization to improve the training effect of the model.

The second step of the pipeline is SVC(random_state=1)the support vector machine classifier. The default parameters are used here,

random_state=1The purpose is to set a random seed to ensure consistent results each time it is run. "clf" stands for classifier

(classifier) ​​means. Access the functions and properties of this classifier by calling clf.

pipe_svc.fit(X_train, y_train)Call fitthe method of the pipeline object, using the training set X_trainand corresponding label

y_trainTrain the model. Next, use the trained model to X_testpredict the test set and obtain the prediction results.

y_pred. The test set feature data here will also undergo the same normalization process as the training set.

4. Confusion matrix, recall rate, precision rate, F1 measure

# 混淆矩阵并可视化
confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)  # 输出混淆矩阵
print(confmat)
fig, ax = plt.subplots(figsize=(2.5, 2.5))
ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confmat.shape[0]):
    for j in range(confmat.shape[1]):
        ax.text(x=j, y=i, s=confmat[i, j], va='center', ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show()
# 召回率、准确率、F1
print('precision:%.3f' % precision_score(y_true=y_test, y_pred=y_pred))
print('recall:%.3f' % recall_score(y_true=y_test, y_pred=y_pred))
print('F1:%.3f' % f1_score(y_true=y_test, y_pred=y_pred))

confusion_matrixis a function in the scikit-learn library, used to calculate the confusion matrix of the classification model on the test set.

In the code, y_truethe parameters represent the real label values, and y_predthe parameters represent the label values ​​predicted by the model. Usually, the true label

The value is the true label of the test set, and the predicted label value is the model's prediction on the test set.

By plt.subplots(figsize=(2.5, 2.5))creating a chart object, specifying the size of the chart to be 2.5x2.5 inches,

Use to ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3)plot a confusion matrix on a graph.

cmap=plt.cm.BluesSpecifies a blue palette for color mapping and alpha=0.3sets the drawing's transparency.

By using two nested loops to traverse each element of the confusion matrix, and then using ax.text()a function to add text at the corresponding position

this label. x=jand y=iset the position of the text label, and s=confmat[i, j]set the content of the text label, that is, the confusion matrix

The corresponding element value. va='center'and ha='center'indicates that the text label is centered in the cell.

5. Operation results

 

Guess you like

Origin blog.csdn.net/weixin_43961909/article/details/131969516