Data mining learning - KNN (k-nearest neighbor)

1. Classification performance of data distributed in different dimensions

Take Iris as an example.

The Iris data set is as follows (use 0, 1, and 2 to represent Iris, Iris and Calamus mallow):

(You can see that the Iris dataset has a total of 4 dimensional features)

The sample distribution of sepal length and sepal width of different species of iris in the Iris data:

 It can be seen from the figure that in terms of the two characteristic dimensions of sepal length and sepal width, species 0 has a certain degree of recognition from species 1 and 2, and a preliminary classification can be made, but species 1 and 2 still cannot be classified.

Now let's look at the Iris dataset in three data dimensions:

 It can be seen that the three categories are stratified at this time, and we can classify all three categories at this time.

This is the categorical representation of data distributed in different dimensions.

2. Principle of KNN algorithm

KNN is an example-based learning without complex mathematical reasoning. Its classification process is directly based on the classification of data sets, so it is also called a lazy learning algorithm that defers all calculations until after classification .

The classification algorithm flow is as follows:

(1) Calculate the distance between the test data and the training data eigenvalues

(2) Sort the distance according to the rules (incremental)

(3) Select k nearest neighbor data for classification decision (voting method)

(4) Classification of prediction test data

3. Several common distance calculation methods

(1) Euclidean distance

(2) Manhattan distance (taxi geometry)

(3) Chebyshev distance

(4) Minkowski distance

4. kd tree

The kd tree is a data structure that divides the k-dimensional space where the data points are located. It is mainly used for key data searches in multi-dimensional spaces. It is essentially a balanced binary tree.

When using the KNN algorithm to classify test data points, it is necessary to calculate the distance between the test data point and each data point in the training sample set, sort the distances, and then find the k nearest neighbor sample data. The advantage of this method is that it is simple and effective, but when the training samples are too large, the calculation process of this method will be time-consuming. In order to improve the search efficiency of the KNN algorithm and reduce the number of distance calculations, k-dimensional (k-dimensional, kd ) tree method .

5. KNN algorithm actual combat

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KDTree
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris=datasets.load_iris()
data_train,data_test,target_train,target_test=train_test_split(iris.data,iris.target,test_size=0.3)

# 定义一个KNN分类器对象,n_neighbors为k值,algorithm是算法
knn=KNeighborsClassifier(n_neighbors=3,algorithm='kd_tree')
knn.fit(data_train,target_train)

# 把测试集的数据集传入即可得到模型的评分
score=knn.score(data_test,target_test)
# 预测给定样本数据对应的标签
predict=knn.predict([[0.1,0.2,0.3,0.4]])

print(score)
print(predict)

  operation result:

 

 

Guess you like

Origin blog.csdn.net/weixin_52135595/article/details/126851568