[Python machine learning] KNN for fruit classification and classifier actual combat (with source code and data set)

If you need source code and data sets, please like and follow the collection and leave a private message in the comment area~~~

Introduction to KNN Algorithm

KNN (K-Nearest Neighbor) algorithm is one of the most basic and simplest algorithms in machine learning algorithms. It can be used for both classification and regression. KNN performs classification by measuring the distance between different feature values.

The idea of ​​the KNN algorithm is very simple: for any n-dimensional input vector, each corresponds to a point in the feature space, and the output is the category label or predicted value corresponding to the feature vector.

The KNN algorithm is a very special machine learning algorithm because it has no learning process in the general sense. Its working principle is to use the training data to divide the feature vector space, and use the division result as the final algorithm model. There is a sample data set, also known as a training sample set, and each data in the sample set has a label, that is, we know the correspondence between each data in the sample set and the category it belongs to.

After inputting unlabeled data, each feature of the unlabeled data is compared with the corresponding features of the data in the sample set, and then the classification label of the data (nearest neighbor) with the most similar features in the sample is extracted.

Generally speaking, we only select the top k most similar data in the sample data set, which is the origin of K in the KNN algorithm, usually k is an integer not greater than 20. Finally, the category with the most occurrences among the k most similar data is selected as the classification of the new data.

The classification prediction process of the KNN classification algorithm is very simple and easy to understand: for an input vector x that needs to be predicted, we only need to find a set of k vectors closest to the vector x in the training data set, and then predict the category of x as this k The category with the largest number of categories in the samples.

There is only one hyperparameter k in the KNN algorithm, and the determination of the k value has a crucial impact on the prediction results of the KNN algorithm. Next, we discuss the influence of the value of k on the results of the algorithm and how to choose the value of k in general.

If the value of k is relatively small, it means that we use training samples in a smaller field to predict instances. At this time, the approximation error of the algorithm (Approximate Error) will be relatively small, because only training samples that are similar to the input instance will have an effect on the prediction result.

However, it also has obvious disadvantages: the estimation error of the algorithm is relatively large, and the prediction result is very sensitive to the neighbor points, that is to say, if the neighbor points are noise points, the prediction will be wrong. Therefore, too small a value of k can easily lead to over-fitting of the KNN algorithm.

Similarly, if the k value is selected to be larger, the training samples with a longer distance can also have an impact on the instance prediction results. At this time, the model is relatively robust and will not affect the final prediction result due to individual noise points. But the disadvantages are also very obvious: the error of the neighbors of the algorithm will be too large, and the points farther away (not similar to the prediction instance) will also affect the prediction results, causing large deviations in the prediction results. At this time, the model is prone to underfitting combine.

Therefore, in actual engineering practice, we generally use cross-validation to select the value of k. From the above analysis, it can be seen that the value of k is generally selected to be relatively small, and we will select the value of k within a small range, and at the same time determine the one with the highest accuracy rate on the test set as the final algorithm hyperparameter k.

Fruit Classification Using KNN

Part of the data is as follows

The prediction results and accuracy are as follows

 

Part of the code is as follows

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
#导入水果数据并查看数据特征
fruit = pd.read_csv('fruit_data.txt','\t')
# 获取属性
X = fruit.iloc[:,1:]
# 获取类别
Y = fruit.iloc[:,0].T
# 划分成测试集和训练集
fruit_train_X,fruit_test_X,fruit_train_y,fruit_test_y=train_test_split(X,Y,test_size=0.2, random_state=0)
#分类eighborsClassifier()
#对训练集进行训练
knn.fit(fruit_train_X, fruit_train_y)
#对测试集数据的水果类型进行预测
predict_result = knn.predict(fruit_test_X)
print('测试集大小:',fruit_test_X.shape)
print('真实结果:',fruit_test_y)
print('预it_test_y))

 Draw a KNN classifier graph

The classification results are as follows, you can see that the iris data set is roughly divided into three categories

 Part of the code is as follows

import numpy as np
from sklearn import neighbors, datasets
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

# 建立KNN模型,使用前两个特征
iris = datasets.load_iris()
irisData = iris.data[:, :2]   # Petal length、Petal width特征
irisTarget = iris.target
clf = neors.KNeighborsClassifier(5) # K=5
clf.fit(irisData, irisTarget)

#绘制plot 
ColorMp = ListedColormap(['#005500', '#00AA00', '#00FF00'])
X_min, X_max = irisData[:, 0].min(), irisData[:, 0].max()
Y_min
label = clf.predict(np.c_[X.ravel(), Y.ravel()])
label = label.reshape(X.shape) 
#绘图并显示
plt.figure()
plt.pcolormesh(X,Y,label,cmap=ColorMp)
plt.show()

It's not easy to create and find it helpful, please like, follow and collect~~~

Guess you like

Origin blog.csdn.net/jiebaoshayebuhui/article/details/128557635