KNN project combat - Digital Recognition
1 Introduction
k-nearest neighbor method (k-nearest neighbor, k-NN) is an essential method for classification and regression in 1967 by Cover T and Hart P. Its working principle is: There is a sample data set, also known as the training sample set, and each data sample set are labels exist, that we know each sample set correspondence between the data and the category. After entering the new data without a label, the feature data corresponding to the characteristics of each new data sample and comparing the concentration of, the sample extraction algorithm and the most similar data (nearest neighbor) class label. In general, we select only the k most similar data before the sample data set, which is k- nearest neighbor in the origin of k, k is usually not an integer greater than 20. Finally, select the highest number of classified data k most similar in appearance, as the classification of new data.
2, the data set introduced
32X32 text format data.
3, code implementation
3.1, the package guide
import numpy as np import pandas as pd import matplotlib.pylab as plt %matplotlib inline
import os
3.2, read training data
# Get the data file the fileList = the os.listdir ( ' ./data/trainingDigits/ ' ) # definition data tag list trainingIndex = []
# add data labels for filename in the fileList: trainingIndex.append (int (filename.split ( ' _ ' ) [0])) # define the data matrix format trainingData np.zeros = ((len (trainingIndex), 1024 )) trainingData.shape
# (3868, 1024)
# Get data matrix index = 0 for filename in the fileList: with Open ( ' ./data/trainingDigits/%s ' % filename, ' RB ' ) AS F: # define an empty matrix vect = np.zeros ((1,1024 )) # circulation line 32 for I in Range (32 ): # reading each row line = f.readline () # traversing each row index line [j] is the data for J in Range (32 ): vect [ 0, 32 * I + J] = int(line[j]) trainingData[index,:] = vect index+=1
3.3, test data is read
= the os.listdir fileList2 ( ' ./data/testDigits/ ' ) # definition data tag list TestIndex = []
# tag data acquired for filename2 in fileList2: testIndex.append (int (filename2.split ( ' _ ' ) [0] )) # reading the test data # define the data matrix format testData np.zeros = ((len (TestIndex), 1024 )) testData.shape # (946, 1024) # acquired data matrix index = 0 for filename2 in fileList2: with Open ( ' ./data/testDigits/%s 'filename2%, ' RB ' ) AS F: # define an empty matrix vect = np.zeros ((1, 1024 )) # circulation line 32 for I in Range (32 ): # reading each row Line = f.readline () # traversing each row index line [j] is the data for J in Range (32 ): vect [0, 32 * I + J] = int (line [j]) testData [,: index] = vect index = 1 +
3.5, data modeling
from sklearn.neighbors Import KNeighborsClassifier # k is defined as five, namely to find the nearest three neighbors KNN = KNeighborsClassifier (N_NEIGHBORS = 3 ) # training data knn.fit (trainingData, trainingIndex)
3.6 Analysis of data
%% Time # forecast data predict_data = knn.predict (testData) # Wall Time: 7.8 S RES = 0 for I in Range (len (TestIndex)): IF TestIndex [I] == predict_data [I]: RES +. 1 = Print ( " recognition accuracy: " + ' % 0.3f ' % (RES / len (TestIndex) * 100) + " % " ) # recognition rate: 98.626%