KNN (K nearest neighbor) - recognize handwritten numbers

KNN project combat - Digital Recognition

1 Introduction

 k-nearest neighbor method (k-nearest neighbor, k-NN) is an essential method for classification and regression in 1967 by Cover T and Hart P. Its working principle is: There is a sample data set, also known as the training sample set, and each data sample set are labels exist, that we know each sample set correspondence between the data and the category. After entering the new data without a label, the feature data corresponding to the characteristics of each new data sample and comparing the concentration of, the sample extraction algorithm and the most similar data (nearest neighbor) class label. In general, we select only the k most similar data before the sample data set, which is k- nearest neighbor in the origin of k, k is usually not an integer greater than 20. Finally, select the highest number of classified data k most similar in appearance, as the classification of new data.

 

2, the data set introduced

32X32 text format data. 

 

 

3, code implementation

3.1, the package guide

 

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline
import os

 

 

3.2, read training data

# Get the data file 
the fileList = the os.listdir ( ' ./data/trainingDigits/ ' ) 

# definition data tag list 
trainingIndex = []
 
# add data labels
for filename in the fileList: trainingIndex.append (int (filename.split ( ' _ ' ) [0])) # define the data matrix format trainingData np.zeros = ((len (trainingIndex), 1024 )) trainingData.shape
#
(3868, 1024)

# Get data matrix 
index = 0
 for filename in the fileList: 
    with Open ( ' ./data/trainingDigits/%s ' % filename, ' RB ' ) AS F: 
        
        # define an empty matrix 
        vect = np.zeros ((1,1024 )) 
        
        # circulation line 32 
        for I in Range (32 ):
             # reading each row 
            line = f.readline () 
            
            # traversing each row index line [j] is the data 
            for J in Range (32 ): 
                vect [ 0, 32 * I + J] = int(line[j])
        
        trainingData[index,:] = vect
        index+=1

 

 

3.3, test data is read

= the os.listdir fileList2 ( ' ./data/testDigits/ ' ) 

# definition data tag list 
TestIndex = []
 
# tag data acquired
for filename2 in fileList2: testIndex.append (int (filename2.split ( ' _ ' ) [0] )) # reading the test data # define the data matrix format testData np.zeros = ((len (TestIndex), 1024 )) testData.shape # (946, 1024) # acquired data matrix index = 0 for filename2 in fileList2: with Open ( ' ./data/testDigits/%s 'filename2%, ' RB ' ) AS F: # define an empty matrix vect = np.zeros ((1, 1024 )) # circulation line 32 for I in Range (32 ): # reading each row Line = f.readline () # traversing each row index line [j] is the data for J in Range (32 ): vect [0, 32 * I + J] = int (line [j]) testData [,: index] = vect index = 1 +

 

3.5, data modeling

from sklearn.neighbors Import KNeighborsClassifier 

# k is defined as five, namely to find the nearest three neighbors 
KNN = KNeighborsClassifier (N_NEIGHBORS = 3 ) 

# training data 
knn.fit (trainingData, trainingIndex)

 

3.6 Analysis of data

%% Time
 # forecast data 

predict_data = knn.predict (testData) 

# Wall Time: 7.8 S 

RES = 0
 for I in Range (len (TestIndex)):
     IF TestIndex [I] == predict_data [I]: 
        RES +. 1 = Print ( " recognition accuracy: " + ' % 0.3f ' % (RES / len (TestIndex) * 100) + " % " )
 # recognition rate: 98.626%

 

Guess you like

Origin www.cnblogs.com/blogscc/p/11518697.html