KNN分类-breast_cancer

数据集:https://scikit-learn.org/stable/datasets/

特征(30个):
mean radius 569 non-null float64
mean texture 569 non-null float64
mean perimeter 569 non-null float64
mean area 569 non-null float64
mean smoothness 569 non-null float64
mean compactness 569 non-null float64
mean concavity 569 non-null float64
mean concave points 569 non-null float64
mean symmetry 569 non-null float64
mean fractal dimension 569 non-null float64
radius error 569 non-null float64
texture error 569 non-null float64
perimeter error 569 non-null float64
area error 569 non-null float64
smoothness error 569 non-null float64
compactness error 569 non-null float64
concavity error 569 non-null float64
concave points error 569 non-null float64
symmetry error 569 non-null float64
fractal dimension error 569 non-null float64
worst radius 569 non-null float64
worst texture 569 non-null float64
worst perimeter 569 non-null float64
worst area 569 non-null float64
worst smoothness 569 non-null float64
worst compactness 569 non-null float64
worst concavity 569 non-null float64
worst concave points 569 non-null float64
worst symmetry 569 non-null float64
worst fractal dimension 569 non-null float64
标签:
type 569 non-null int64

一、加载各种库

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import matplotlib as mpl
import matplotlib.pyplot as plt
# 设置字体为黑体,以支持中文显示。
mpl.rcParams["font.family"] = "SimHei"
# 设置在中文字体时,能够正常的显示负号(-)。
mpl.rcParams["axes.unicode_minus"] = False

二、数据预处理

# 加载数据集
data = pd.read_csv(r"cancer.csv",header=0)
#data.sample(30)
#data.info()
# 查看是否含有异常值
#data.describe()
# 检查是否包含重复值
#data.duplicated().any()
# 如果有重复值,可以这样去除重复值
# data.drop_duplicates(inplace=True)
data["type"].value_counts()

三、用knn进行分类

# 将加载的数据集分为特征X与标签y。
X, y = data.iloc[:, :-1], data.iloc[:, -1]
#通过train_test_splil将数据分为训练集、测试集,测试集占0.25的比例
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)
#display(len(train_y))
#display(len(test_y))

#实例化KNN模型
knn=KNeighborsClassifier()
#训练模型
knn.fit(train_X,train_y)
#传入测试集进行测试
result=knn.predict(test_X)
#对模型进行评估
display(np.sum(result == test_y))
display(len(result))
display(np.sum(result == test_y)/len(result))

四、可视化显示

t0 = train_X[train_y == 0]
t1 = train_X[train_y == 1]

# 设置画布的大小
plt.figure(figsize=(15, 10))
# 绘制训练集数据
plt.scatter(x=t0["mean radius"], y=t0["mean texture"], color="r", label="恶性")
plt.scatter(x=t1["mean radius"], y=t1["mean texture"], color="g", label="良性")
# 绘制测试集数据
right = test_X[result == test_y]
wrong = test_X[result != test_y]
plt.scatter(x=right["mean radius"], y=right["mean texture"], color="c", marker="x", label="right")
plt.scatter(x=wrong["mean radius"], y=wrong["mean texture"], color="m", marker=">", label="wrong")
plt.xlabel("半径平均值")
plt.ylabel("纹理平均值")
plt.title("KNN分类结果")
plt.legend(loc="best")
plt.show()

猜你喜欢

转载自blog.csdn.net/weixin_42295205/article/details/91470063