Code from "machine learning real" https://github.com/wzy6642/Machine-Learning-in-Action-Python3
K- nearest neighbor (KNN)
Introduction
Briefly, k- nearest neighbor distance measurement method using the characteristic values between the different classification.
Advantages: high accuracy, is insensitive to outliers , no data input is assumed.
Disadvantages: high computational complexity, space complexity is high, can not give the inner meaning of the data.
Use Data range: numeric, nominal type.
Classification function pseudocode:
Each point on the unknown class attribute data set sequentially perform the following operations:
(1) calculate the distance between the point known class data set and the current point;
(2) sorted in ascending order of distance;
(3) Select the minimum k points from the current point;
(4) to determine the probability of occurrence of categories where the first k points;
(5) before returning to the highest frequency class k points appear as a predicted classification of the current point.
1 "" " Create Data Set 2 Returns: group - the data set . 3 Labels - classification tag . 4 " "" . 5 DEF CreateDataSet (): . 6 # four-dimensional feature . 7 Group np.array = ([[1, 101], [5, 89], [108 5], [115, 8 ]]) 8 # tags four characteristics . 9 labels = [ ' love stories ' , ' love stories ' , ' action movie ' , ' action movie ' ] 10 return Group, Labels . 11 12 is 13 is "" " 14 KNN algorithm, classifier 15 Parameters: 16 inX - for classification data (test set) . 17 dataSet A - for the training data (training set) (n * 1 column vector of dimension) 18 is Labels - classification criteria (n * 1-dimensional column vector) . 19 k - the KNN algorithm parameters, select a minimum distance k points 20 Back: 21 is sortedClasscount [0] [0] - classification result 22 is "" " 23 is DEF classify0 (inX, dataSet a, Labels, k): 24 # numpy function shape [0] of the number of rows returned dataSet (dimension) 25 dataSetSize = dataSet.shape [0] 26 is # will inX dataSetSize times repeated and arranged in a row 27 diffMat = np.tile (inX, (dataSetSize,. 1)) - dataSet A 28 # dimensional feature after subtraction square 29 sqDiffMat diffMat = 2 ** 30 # SUM () sum all the elements, sum (0) column addition, sum (1) the addition line 31 is sqDistances = sqDiffMat.sum (Axis =. 1 ) 32 # prescribing calculated distance 33 is distances 0.5 ** = sqDistances 34 is # argsort distances function returns a value of from small to large index values 35 sortedDistIndicies = distances.argsort () 36 # define a category dictionary number recorded 37 [ classCount = {} 38 is # distance from minimum the k points 39 for I in Range (k): 40 # removed before the category k elements 41 is voteIlabel =Labels [sortedDistIndicies [I]] 42 is # dictionary get () method returns the value of the specified key, if the value is not in the dictionary returns 0 43 # calculates the number of categories 44 is classCount [voteIlabel] = classCount.get (voteIlabel, 0) +. 1 45 # Reverse dictionary in descending order, operator.itemgetter (1) sorted by the value (0) Sort button 46 is sortedClassCount the sorted = (classCount.items (), operator.itemgetter Key = (. 1), Reverse = True) 47 # returns the number of most categories, i.e., to be classified category 48 return sortedClassCount [0] [0] 49 50 # test 51 is Group, Labels = CreateDataSet () 52 is classify0 ([0,0], Group, Labels,. 3) # Output: 'love sheet'
Actual: Handwriting Recognition System Digital
Here only the identification numbers 0 through 9, the image is monochrome image is 32 * 32 pixels, the image is converted to text format.
The formatting process is a vector image, the binary image matrix a vector of 32 * 32 1 * 2014.
1 "" " 2 binary image converting 32 * 32 1 * 1024 vector 3 parameters: . 4 filename - File name 5 returns: . 6 returnVect - returns a binary image 1 * 1024 vector . 7 " "" . 8 . 9 DEF img2vector (filename ): 10 returnVect np.zeros = ((. 1, 1024 )) . 11 fr = Open (filename) 12 is # read in the row 13 is for I in Range (32 ): 14 # read data line 15 lineStr = fr.readline ( ) 16 # before the data of each row 32 are sequentially stored in returnVect . 17 for J in Range (32 ): 18 is returnVect [0, 32 * I + J] = int (lineStr [J]) . 19 # . 1 * 1024 vector after return switching 20 is return returnVect 21 is 22 is # test 23 is testVector = img2vector ( ' testDigits / 0_13.txt ' ) 24 testVector [0, 0:31 ] 25 # Output: Array ([0, of 0. The, of 0. The, of 0. The, of 0. The, of 0. The, of 0. The, of 0. The, 0.. , 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1, 1, 26 # 1, 0.5, 0.5, 0.5, 0.5, 0.5, 0., 0. , 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
Test algorithm
1 "" " 2 handwritten figures Category Test 3 parameters: . 4 None . 5 returns: . 6 None . 7 " "" . 8 DEF handwritingClassTest (): . 9 # Labels test set 10 hwLabels = [] . 11 # File Name Return trainingDigits directory under 12 = the listdir trainingFilesList ( ' trainingDigits ' ) 13 is # returns the number of file folder 14 m = len (trainingFilesList) 15 # Mat matrix initialization training (all zeros needle), test set 16 = np.zeros trainingMat ((m, 1024 )) . 17 # parse the file name from the category of the training set 18 is for I in Range (m): . 19 # obtain a file name 20 is fileNameStr = trainingFilesList [I] 21 is # obtain classification figure 22 is classNumber = int (fileNameStr.split ( ' _ ' ) [0]) 23 is # category added to hwLabels obtained in 24 hwLabels.append (classNumber) 25 # the data store 1024 * 1 of each file to trainingMat matrix 26 is trainingMat [I,:] = img2vector ( ' trainingDigits / S% '% (FileNameStr)) 27 # configured KNN classifier 28 neigh = KNN (N_NEIGHBORS =. 3, algorithm = ' Auto ' ) 29 # fit model, trainingMat for the test matrix, hwLabels corresponding tab 30 neigh.fit (trainingMat, hwLabels) 31 # file back testDigits directory list 32 testFileList = the listdir ( ' testDigits ' ) 33 is # error detection counter 34 is errorCount = 0.0 35 # number of test data 36 MTEST = len (testFileList) 37 [ # parsed from the document set of tests category and classification test 38 for I in Range (MTEST): 39 # obtain a file name 40 fileNameStr = testFileList [I] 41 is # numbers obtained classification 42 is classNumber = int (fileNameStr.split ( ' _ ' ) [0]) 43 is # obtain a test set 1024 * vectors used to train 44 is vectorUnderTest = img2vector ( ' testDigits / S% ' % (fileNameStr)) 45 # obtain predictor 46 is classifierResult = neigh.predict (vectorUnderTest) 47 Print ( " classification results returned% d \ t true the result is% d" % (ClassifierResult, classNumber)) 48 IF (classifierResult =! ClassNumber): 49 errorCount + 1.0 = 50 Print ( " total wrong data% d \ n error rate F %%% " % (errorCount, errorCount / MTEST * 100))