sklearn experiment 2 - use KNN to classify the iris data set

1. Purpose of the experiment

  1. Understand the basic principles of the nearest neighbor method, and be able to implement the K-nearest neighbor (KNN) algorithm;

  2. For specific application scenarios and data, KNN can be used for classification;

  3. Familiar with the impact of different neighbor numbers on the algorithm.

2. Experimental principle

The KNN algorithm is a simple classification algorithm. It uses all known samples as a reference to calculate the distance between unknown samples and all known samples, and selects the K known samples closest to the unknown samples. According to the majority voting rule, the unknown samples and K Among the nearest neighbor samples, the category that belongs to a large proportion is classified into one category.

3. Experimental content

  1. Read the iris iris dataset.
  2. KNN algorithm implementation:
    1. With the default number of neighbors and weights, what is the prediction accuracy?
    2. Increase or decrease the number of neighbors to observe the impact on the algorithm;
    3. Change the weight settings and observe the effect on the algorithm.
  3. Use sepal length and sepal width, set the number of neighbors to 5, and plot the KNN classifier.

4. Experiment code

1. Import the relevant libraries to be used in the experiment

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

2. Load and divide the iris data set (here all features are selected for classification)

iris_data = load_iris()
X = iris_data["data"]
y = iris_data["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10)

3. Use the default number of neighbors and weights to view the prediction accuracy

# 使用默认的近邻数量和权重,查看预测精度
knn = KNeighborsClassifier()    # 默认 n_neighbors = 5, weight = 'uniform'
knn.fit(X_train, y_train)       # 用训练数据拟合模型
print("模型精度:", knn.score(X_test, y_test))

K-nearest neighbor classification model KNeighborsClassifier default n_neighbors = 5, that is, use the 5 training samples closest to the unknown sample to classify it; default weight = 'uniform', which means using a uniform weight, that is, all points in each neighborhood are classified weighted. If you set weight = 'distance', it means the reciprocal of the weight point and its distance, that is, the nearest neighbors of the query point have more influence than the distant neighbors.

output:

模型精度: 0.9736842105263158

4. View the data points used in the first data division of the test set

# 查看测试集第一个数据划分所用到的数据点
n = knn.kneighbors_graph(X_test)
print(n.toarray()[0])

kneighbors_gragp() calculates the (weighted) graph of the k neighbors of the point in X_test, and its output is in the form of a sparse matrix, which is converted into an array for output. The result of the first data is shown below.

output:

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]

5. Increase the number of neighbors to see the impact on the algorithm

# 增大近邻数,查看对算法的影响
knn = KNeighborsClassifier(n_neighbors=15) 
knn.fit(X_train, y_train)       # 用训练数据拟合模型
print("模型精度:", knn.score(X_test, y_test))

output:

模型精度: 0.9736842105263158

6. Change the weight setting to see the impact on the algorithm

# 改变权重设置,查看对算法的影响
knn = KNeighborsClassifier(weights='distance')  # distence表示用距离的导数作为权重设置,即越近的点影响力越大
knn.fit(X_train, y_train)
print("模型精度:", knn.score(X_test, y_test))

output:

模型精度: 0.9736842105263158

7. View the accuracy of different n_neighbors settings on the training set and test set

# 查看不同 n_neighbors 的设置在训练集和测试集上的精度
training_accuracy = []
test_accuracy = []
# n_neighbors取值从1到16
neighbors_settings = range(1, 16)
for n_neighbors in neighbors_settings:
    # 构建模型
    clf = KNeighborsClassifier(n_neighbors=n_neighbors)
    clf.fit(X_train, y_train)
    # 记录训练集精度
    training_accuracy.append(clf.score(X_train, y_train))
    # 记录泛化精度
    test_accuracy.append(clf.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()

output:

 8. Use the sepal length and sepal width, set the number of neighbors to 5, and draw the KNN classifier diagram

# 使用花萼长度和花萼宽度,设置近邻数量为5,绘制KNN分类器图
X = iris_data["data"][:,(0,1)]
y = iris_data["target"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

from matplotlib.colors import ListedColormap
h = .02  # 设置网格中的步长
n_neighbors = 5

# 提取色谱
cmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])
cmap_bold = ListedColormap(['darkorange', 'c', 'darkblue'])

for weights in ['uniform', 'distance']:
    # 我们创建最近邻分类器的实例并拟合数据。
    clf = KNeighborsClassifier(n_neighbors, weights=weights)
    clf.fit(X, y)

    # 绘制决策边界。 为此,我们将为网格[x_min,x_max] x [y_min,y_max]中的每个点分配颜色。
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # 将结果放入颜色图
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # 绘制训练数据
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xlabel('sepal length')
    plt.ylabel('sepal width')
    plt.title("3-Class classification (k = %i, weights = '%s')"
              % (n_neighbors, weights))

plt.show()

output:

 

5. Experiment summary

The influence of different neighbor numbers of KNN on the algorithm:
as the number of neighbors increases, the decision boundary becomes smoother. Smoother boundaries correspond to simpler models.
Using fewer neighbors corresponds to higher model complexity, while using more neighbors corresponds to lower model complexity.

The impact of different weights of KNN on the algorithm:
KNN uses a unified weight by default, that is, the weights of points in each neighborhood are the same. If distance is used as the weight, then the neighbors that are closer to the points to be classified will be given greater weights. The decision-making process plays a more important role.

Guess you like

Origin blog.csdn.net/lyb06/article/details/130162731