1. Guide package:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, f1_score
2. Import data:
# 导入数据
file = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data',
header=None)
df=file
# 提取DataFrame中所有行的第2列及之后所有列的操作,将其转换为一个NumPy数组
X = df.loc[:, 2:].values
y = df.loc[:, 1].values
le = LabelEncoder()
y = le.fit_transform(y) # 类标整数化
df
Is a DataFrame object and loc
is a method used to select data by label. [:, 2:]
Indicates selecting all rows (colon indicates
all rows) and all columns from column 2 to the last column. values
Convert selected data to a NumPy array.
LabelEncoder()
Is a class in the scikit-learn library for label encoding. It converts a set of text or categorical labels from 0
The starting consecutive integer value. The method fits and transforms fit_transform()
the original labels . y
The fitting operation will be based on the data
The tags that appear automatically build coding mapping relationships. The conversion operation converts the original tag into the corresponding encoded value. Eventually, the variable y
will contain
Contains the encoded tag value. fit_transform()
The method performs both fitting and transformation operations. If you just need to convert,
transform()
Methods can be used . Additionally, LabelEncoder()
label encoding applies to individual columns. If you have a lot of data
columns, you may want to consider using OneHotEncoder()
or other encoding methods to handle categorical features.
LabelEncoder()
Example code for class:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# 假设有一个标签列表
labels = ['cat', 'dog', 'cat', 'bird', 'dog']
# 使用LabelEncoder进行标签编码
encoded_labels = le.fit_transform(labels)
print(encoded_labels)
# [0 1 0 2 1]
3. Divide the training set into the test set
# 划分训练集合测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
# 建立pipeline
pipe_svc = Pipeline([('scl', StandardScaler()), ('clf', SVC(random_state=1))])
pipe_svc.fit(X_train, y_train)
y_pred = pipe_svc.predict(X_test)
train_test_split
is a function in the scikit-learn library that is used to divide the data set into a training set and a test set. it can root
Randomly divides data according to a specified ratio or sample size while maintaining its distribution.
In the code, X
it is the feature data set and y
the corresponding label data set. test_size=0.20
Indicates dividing the data into 80% training
set and 20% of the test set. random_state=1
Used to set a random seed to ensure consistent division results each time it is run.
The division results will be stored in the four variables of X_train
, X_test
, y_train
and respectively, which represent the characteristics of the training set respectively.y_test
features, test set features, training set labels, and test set labels.
Pipeline
The function is a class in the scikit-learn library, used to create a pipeline object to process multiple data
The steps come together. Pipeline objects can conveniently organize multiple data preprocessing and model training steps into a whole, simplifying
Coding and managing machine learning pipelines. It has the following functions:
①Data pipeline: Pipelines can combine multiple data processing steps in sequence to form a data pipeline. every step
It can be a data preprocessing operation, such as feature scaling, feature selection, feature extraction, etc.
②Parameter sharing: Through pipelines, parameters can be shared in each step. For example, in the data normalization and model training steps
Share the same parameter settings and avoid manually adjusting parameters for multiple steps.
③Code simplification: Using pipelines can integrate multiple processing steps into one object, simplifying code writing and maintenance. by adjusting
Using the method of the pipeline object, you can complete the entire process at once without calling each step separately.
⑤Model persistence: The entire pipeline is persisted as a model object. The entire pipeline can be saved to disk,
Convenient for subsequent deployment and use.
In the code, the first step of the pipeline StandardScaler()
, "scl" refers to the preprocessing step that scales the data.
"StandardScaler" is a preprocessor used to standardize data. It will standardize each feature of the data in a certain way.
Scale so that the data conforms to a standard normal distribution with mean 0 and variance 1. The purpose of this step is to pass data standards
ization to improve the training effect of the model.
The second step of the pipeline is SVC(random_state=1)
the support vector machine classifier. The default parameters are used here,
random_state=1
The purpose is to set a random seed to ensure consistent results each time it is run. "clf" stands for classifier
(classifier) means. Access the functions and properties of this classifier by calling clf.
pipe_svc.fit(X_train, y_train)
Call fit
the method of the pipeline object, using the training set X_train
and corresponding label
y_train
Train the model. Next, use the trained model to X_test
predict the test set and obtain the prediction results.
y_pred
. The test set feature data here will also undergo the same normalization process as the training set.
4. Confusion matrix, recall rate, precision rate, F1 measure
# 混淆矩阵并可视化
confmat = confusion_matrix(y_true=y_test, y_pred=y_pred) # 输出混淆矩阵
print(confmat)
fig, ax = plt.subplots(figsize=(2.5, 2.5))
ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3)
for i in range(confmat.shape[0]):
for j in range(confmat.shape[1]):
ax.text(x=j, y=i, s=confmat[i, j], va='center', ha='center')
plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show()
# 召回率、准确率、F1
print('precision:%.3f' % precision_score(y_true=y_test, y_pred=y_pred))
print('recall:%.3f' % recall_score(y_true=y_test, y_pred=y_pred))
print('F1:%.3f' % f1_score(y_true=y_test, y_pred=y_pred))
confusion_matrix
is a function in the scikit-learn library, used to calculate the confusion matrix of the classification model on the test set.
In the code, y_true
the parameters represent the real label values, and y_pred
the parameters represent the label values predicted by the model. Usually, the true label
The value is the true label of the test set, and the predicted label value is the model's prediction on the test set.
By plt.subplots(figsize=(2.5, 2.5))
creating a chart object, specifying the size of the chart to be 2.5x2.5 inches,
Use to ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3)
plot a confusion matrix on a graph.
cmap=plt.cm.Blues
Specifies a blue palette for color mapping and alpha=0.3
sets the drawing's transparency.
By using two nested loops to traverse each element of the confusion matrix, and then using ax.text()
a function to add text at the corresponding position
this label. x=j
and y=i
set the position of the text label, and s=confmat[i, j]
set the content of the text label, that is, the confusion matrix
The corresponding element value. va='center'
and ha='center'
indicates that the text label is centered in the cell.
5. Operation results