02 Machine learning: k nearest neighbor algorithm

0. Core idea

Those who are close to Zhu are red, those who are close to ink are black. Those with the same label tend to be gathered together, and the prediction is made based on the k nearest neighbors to the sample to be predicted.
insert image description here

  • kNN is supervised learning
  • Both classification and regression can use kNN

1. Algorithm principle

  1. Obtain k sample data closest to the sample to be predicted from the training set. this distance isdistance between feature attributes
  2. According to the k sample data obtained in 1, predict the target attribute value of the current sample to be predicted

a. Feature space

The following figure is the feature space of two features, which is a plane.
insert image description here
The feature space of three feature numbers is three-dimensional, and the feature space of 4 and 5 features cannot be imagined, but it exists and is calledhyperspace

b.distance

Min's distance
insert image description here
is Euclidean distance when p=2; p=1 is Manhattan distance

ck's choice

k is a hyperparameter in the kNN algorithm.
Look at the picture below
insert image description here
k=3, which is predicted to be red; k=5 is predicted to be blue.
The k value affects the prediction results, so we need to choose a more appropriate final value through experiments (cross-validation).

Two special cases of k

  • When k=1 (k is small), it can beGet 100% accuracy on the training dataset. But it is conceivable thatoverfitting, which is why cross-validation is used.
  • When k=number of training samples (k is large),The accuracy rate on the training data set is the proportion of the sample data,easyunderfitting

d. Decision rules

  • In the classification model, the majority voting method or the weighted majority voting method are mainly used
  • In the regression model, the average method or weighted average method is mainly used

Classification Tasks: Weighted Majority Voting

The weight of each adjacent sample is different, the closer the weight is, the greater the weight of speaking when voting.
Generally useWeight is inversely proportional to distanceway to calculate. Since everyone is one vote, the weights are added up,The final prediction result is the category with the largest weight.
For example, in the figure below, assuming that the distance from the three red points to the sample point to be predicted is 2, and the weight is 1/2; the distance from the two yellow points to the sample point to be predicted is 1, and the weight is 1, then the blue circle The final category of is yellow (1+1>1/2+1/2+1/2).
insert image description here

weight normalization

There is another problem. The sum of the weights of these five people is not 1, but 7/2. It is best to normalize, that is, divide each person's voting rights by 7/2. For example, the voting rights of the red sample are 1/2 is divided by 7/2.
It is also possible to not normalize.

Regression task: weighted average method

The weight of each adjacent sample is different, the closer the weight is, the greater the weight of speaking when voting.
Generally useWeight is inversely proportional to distanceway to calculate. For example, in the figure below, suppose the three red points have a value of 3, the distance to the sample point to be predicted is 2, and the weight is 1/7 (1/2 divided by 7/2); the two yellow points have a value of 2, The distance to the sample point to be predicted is 1, and the weight is 2/7 (1 divided by 7/2). (Normalization added)
insert image description here
prediction results:
insert image description here

2. sklearn code

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier as knn

#00数据加载
X,Y = load_iris(return_X_y=True)
#01数据划分
X_train,X_test,y_train,y_test = train_test_split(X,Y,test_size=0.2,random_state=0)
#02训练
knn_clf = knn(n_neighbors=3,weights='uniform',algorithm='kd_tree')
knn_clf.fit(X_train,y_train)
#03测试
print('the score in train data:',knn_clf.score(X_train,y_train))
print('the score in test data:',knn_clf.score(X_test,y_test))

Guess you like

Origin blog.csdn.net/qq_42911863/article/details/125692684