python implements KNN (nearest neighbor) algorithm

KNN (nearest neighbor) algorithm

The KNN algorithm is probably the most intuitive one among the standard data mining algorithms. In order to classify new individuals, it searches the training set, finds those individuals most similar to the new individuals, sees which category these individuals mostly belong to, and then classifies the new individuals into which category
KNN algorithm can classify almost any data set. However, to calculate the distance between every two individuals in the data set, the amount of calculation is very large

Data set selection

This data set uses ionization data. Each row of the data set has 35 values. The first 34 are data collected by the antenna. The last value is either "g" or "b", indicating whether the data is good or bad, that is, whether it is provided the valuable information
we need to do is to use the KNN algorithm to automatically determine the quality of the data
following is a partial view of the data

1,0,0.47337,0.19527,0.06213,-0.18343,0.62316,0.01006,0.45562,-0.04438,0.56509,0.01775,0.44675,0.27515,0.71598,-0.03846,0.55621,0.12426,0.41420,0.11538,0.52767,0.02842,0.51183,-0.10651,0.47929,-0.02367,0.46514,0.03259,0.53550,0.25148,0.31953,-0.14497,0.34615,-0.00296,g
1,0,0.59887,0.14689,0.69868,-0.13936,0.85122,-0.13936,0.80979,0.02448,0.50471,0.02825,0.67420,-0.04520,0.80791,-0.13748,0.51412,-0.24482,0.81544,-0.14313,0.70245,-0.00377,0.33333,0.06215,0.56121,-0.33145,0.61444,-0.16837,0.52731,-0.02072,0.53861,-0.31262,0.67420,-0.22034,g
1,0,0.84713,-0.03397,0.86412,-0.08493,0.81953,0,0.73673,-0.07643,0.71975,-0.13588,0.74947,-0.11677,0.77495,-0.18684,0.78132,-0.21231,0.61996,-0.10191,0.79193,-0.15711,0.89384,-0.03397,0.84926,-0.26115,0.74115,-0.23312,0.66242,-0.22293,0.72611,-0.37792,0.65817,-0.24841,g
1,0,0.87772,-0.08152,0.83424,0.07337,0.84783,0.04076,0.77174,-0.02174,0.77174,-0.05707,0.82337,-0.10598,0.67935,-0.00543,0.88043,-0.20924,0.83424,0.03261,0.86413,-0.05978,0.97283,-0.27989,0.85054,-0.18750,0.83705,-0.10211,0.85870,-0.03261,0.78533,-0.10870,0.79076,-0.00543,g
1,0,0.74704,-0.13241,0.53755,0.16996,0.72727,0.09486,0.69565,-0.11067,0.66798,-0.23518,0.87945,-0.19170,0.73715,0.04150,0.63043,-0.00395,0.63636,-0.11858,0.79249,-0.25296,0.66403,-0.28656,0.67194,-0.10474,0.61847,-0.12041,0.60079,-0.20949,0.37549,0.06917,0.61067,-0.01383,g
1,0,0.46785,0.11308,0.58980,0.00665,0.55432,0.06874,0.47894,-0.13969,0.52993,0.01330,0.63858,-0.16186,0.67849,-0.03326,0.54545,-0.13525,0.52993,-0.04656,0.47894,-0.19512,0.50776,-0.13525,0.41463,-0.20177,0.53930,-0.11455,0.59867,-0.02882,0.53659,-0.11752,0.56319,-0.04435,g
1,0,0.88116,0.27475,0.72125,0.42881,0.61559,0.63662,0.38825,0.90502,0.09831,0.96128,-0.20097,0.89200,-0.35737,0.77500,-0.65114,0.62210,-0.78768,0.45535,-0.81856,0.19095,-0.83943,-0.08079,-0.78334,-0.26356,-0.67557,-0.45511,-0.54732,-0.60858,-0.30512,-0.66700,-0.19312,-0.75597,g
1,0,0.93147,0.29282,0.79917,0.55756,0.59952,0.71596,0.26203,0.92651,0.04636,0.96748,-0.23237,0.95130,-0.55926,0.81018,-0.73329,0.62385,-0.90995,0.36200,-0.92254,0.06040,-0.93618,-0.19838,-0.83192,-0.46906,-0.65165,-0.69556,-0.41223,-0.85725,-0.13590,-0.93953,0.10007,-0.94823,g
1,0,0.88241,0.30634,0.73232,0.57816,0.34109,0.58527,0.05717,1,-0.09238,0.92118,-0.62403,0.71996,-0.69767,0.32558,-0.81422,0.41195,-1,-0.00775,-0.78973,-0.41085,-0.76901,-0.45478,-0.57242,-0.67605,-0.31610,-0.81876,-0.02979,-0.86841,0.25392,-0.82127,0.00194,-0.81686,g
1,0,0.83479,0.28993,0.69256,0.47702,0.49234,0.68381,0.21991,0.86761,-0.08096,0.85011,-0.35558,0.77681,-0.52735,0.58425,-0.70350,0.31291,-0.75821,0.03939,-0.71225,-0.15317,-0.58315,-0.39168,-0.37199,-0.52954,-0.16950,-0.60863,0.08425,-0.61488,0.25164,-0.48468,0.40591,-0.35339,g
1,0,0.92870,0.33164,0.76168,0.62349,0.49305,0.84266,0.21592,0.95193,-0.13956,0.96167,-0.47202,0.83590,-0.70747,0.65490,-0.87474,0.36750,-0.91814,0.05595,-0.89824,-0.26173,-0.73969,-0.54069,-0.50757,-0.74735,-0.22323,-0.86122,0.07810,-0.87159,0.36021,-0.78057,0.59407,-0.60270,g

Run the algorithm

import numpy as np
import csv
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# 新建特征数据,默认为float类型
X = np.zeros((351, 34), dtype="float")
# 新建类别数据
y = np.zeros((351,), dtype="bool")

# 读取数据集并且赋值
with open("ionosphere.data", "r") as input_file:
    reader = csv.reader(input_file)
    for i, row in enumerate(reader):
        data = [float(datum) for datum in row[:-1]]
        X[i] = data
        y[i] = row[-1] == 'g'


# 对数据进行分割,创建训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=14)

# 建立KNN模型
estimator = KNeighborsClassifier()
# 使用训练数据进行训练
estimator.fit(X_train, y_train)
# 用测试集测试算法,评估它在测试集上的表现
y_predicted = estimator.predict(X_test)
# 查看正确率
accuracy = np.mean(y_test == y_predicted) * 100
print("{:.1f}".format(accuracy))

The results are as follows:

86.4%

Cross check

In the above code, we randomly divide the data set into a training set and a test set, use the training set to train the algorithm, and evaluate the effect on the test set. But because it is randomly assigned and only tested once, there will be chance.
Cross-checking can solve the problems caused by the above one-time test. His method of implementation is to split the data set multiple times, and each split must be different from the last split. Also make sure that each piece of data can only be used to test the algorithm once described as follows:

  1. Divide the entire data set into several parts
  2. Perform the following operations for each part
    (1), use one part as the current test set
    (2), use the remaining part to train the algorithm
    (3), test the algorithm on the current test set
  3. Record each score and average score
  4. In the above process, each piece of data can only appear once in the test set to reduce (but not completely evade) the luck component

An auxiliary function cross_val_score is provided in the sklearn library to implement the above cross-checking steps

from sklearn.model_selection import cross_val_score

# 进行交叉检验
scores = cross_val_score(estimator, X, y, scoring="accuracy")
# 求平均得分
average_accurancy = np.mean(scores) * 100
print("{:.1f}".format(average_accurancy))

Note: This code is connected to the previous code, and the running result is as follows:

82.3%

The result is 82.3%, which shows that the segmentation of the training set and the test set has a certain impact on the accuracy of the algorithm. Next, we will adjust the parameters to improve the accuracy of the algorithm.

Setting parameters

Almost all data mining algorithms can set parameters. The advantage of doing so is to enhance the generalization ability of the algorithm. The setting of parameters can directly affect the accuracy of the algorithm, and the selected parameters are related to the characteristics of the data set.

The KNN algorithm has multiple parameters. The most important thing is to select how many neighbors are used as the basis for prediction. Next, set different parameters to conduct multiple experiments to observe the difference between the results brought by different parameter values.

# 保存不同参数下的平均分
avg_socre = []
# 参数从1-20
para = list(range(1, 21))
# 对每个参数进行交叉检验,将结果保存到列表
for o in para:
    estimator = KNeighborsClassifier(n_neighbors=o)
    score = cross_val_score(estimator, X, y, scoring="accuracy")
    avg_socre.append(np.mean(score))
from matplotlib import pyplot as plt
# 可视化处理
plt.plot(para, avg_socre, "-o")
plt.show()

The results are shown in the figure:
Write picture description here

As can be seen from the above figure, as the value of the parameter increases, the accuracy of the algorithm is continuously decreasing

Guess you like

Origin blog.csdn.net/mrliqifeng/article/details/82592867