This article is used to practice the code implementation of the K-nearest neighbor algorithm! ! !
Algorithm idea:
By calculating the distance from each test sample to the training sample, take the labels of the k training samples closest to the test sample, which class of training samples
At most, the test sample falls into this category.
Advantages: simple idea; no training process required; suitable for multi-classification problems
Disadvantage: When there are many training samples, the computational complexity is high;
Steps:1. Calculate the Euclidean distance from the test sample to each training sample
2. Sort the distances to obtain the labels of the nearest k samples
3. Compare the k sample labels that appear the most is the test sample label
scikit-learn machine learning algorithm library implementation.
from sklearn import neighbors #Custom two-dimensional list training set train = [[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]] label = [-1,-1,1,1] # get the classifier knn = neighbors.KNeighborsClassifier(n_neighbors=2) knn.fit(train, label) print( knn.predict([[1.0,0.9]]) ) #return the probability of each element of the test data print( knn.predict_proba([[1.0,0.9]]) )
Python implementation
import numpy as np #Customize the training dataset and save the txt file Train = np.array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) Label = np.array([[-1],[-1],[1],[1]]) TrainSet = np.concatenate((Train,Label),axis=1) print( TrainSet.shape ) np.savetxt("KNNTrainSet.txt",TrainSet) #Customize the test dataset and save the txt file TestSet = np.array([[1.0,0.9],[0.1,0.2]]) np.savetxt("KNNTestSet.txt",TestSet) #Extract training data and test data a = np.loadtxt('KNNTrainSet.txt',usecols=(0,1),unpack=True) train=a.T print( train.shape ) b = np.loadtxt('KNNTrainSet.txt',usecols=(2),unpack=True) label=b.T c = np.loadtxt('KNNTestSet.txt',usecols=(0,1),unpack=True) test=c.T #build KNN classifier def Classifier(t,x,y,k): #Get the number of training samples trainSize = x.shape[0] #The difference between the test data and each training data diff = np.tile(t,(trainSize,1)) - x sqdiff = diff ** 2 sumdiff = sqdiff.sum(axis=1) Distance = sumdiff ** 0.5 print('Euclidean distance:', Distance) sorteDistance = Distance.argsort() print('The subscript array corresponding to the ascending distance:', sorteDistance) #Create a matrix to store labels KLabel = np.zeros(k) class1=0 class2=0 print (KLabel) for i in range(k): nearLabel = y[sorteDistance[i]] print('The label of the most recent sample:', nearLabel) KLabel[i] = nearLabel #Get the number of each class in k labels if(nearLabel==-1): class1=class1+1 if(nearLabel==1): class2=class2+1 print("Number of categories: ",class1," ",class2) if(class1>class2): result=-1 else: result=1 return result #Call the classifier to classify the test data set testnum = test.shape[0] for i in range(testnum): value=Classifier(test[i],train,label,2) print("The classification result of sample {0} is {1}".format(i,value))
References:
2. "Machine Learning in Practice"