"Machine learning real" --KNN

Code from "machine learning real" https://github.com/wzy6642/Machine-Learning-in-Action-Python3

K- nearest neighbor (KNN)

Introduction

Briefly, k- nearest neighbor distance measurement method using the characteristic values ​​between the different classification.

Advantages: high accuracy, is insensitive to outliers , no data input is assumed.

Disadvantages: high computational complexity, space complexity is high, can not give the inner meaning of the data.

Use Data range: numeric, nominal type.

Classification function pseudocode:

  Each point on the unknown class attribute data set sequentially perform the following operations:

  (1) calculate the distance between the point known class data set and the current point;

  (2) sorted in ascending order of distance;

  (3) Select the minimum k points from the current point;

  (4) to determine the probability of occurrence of categories where the first k points;

  (5) before returning to the highest frequency class k points appear as a predicted classification of the current point.

1  "" " Create Data Set
 2  Returns: group - the data set
 . 3       Labels - classification tag
 . 4  " "" 
. 5  DEF CreateDataSet ():
 . 6      # four-dimensional feature 
. 7      Group np.array = ([[1, 101], [5, 89], [108 5], [115, 8 ]])
 8      # tags four characteristics 
. 9      labels = [ ' love stories ' , ' love stories ' , ' action movie ' , ' action movie ' ]
 10      return Group, Labels
 . 11  
12 is  
13 is  "" "
14 KNN algorithm, classifier
 15  Parameters:
 16      inX - for classification data (test set)
 . 17      dataSet A - for the training data (training set) (n * 1 column vector of dimension)
 18 is      Labels - classification criteria (n * 1-dimensional column vector)
 . 19      k - the KNN algorithm parameters, select a minimum distance k points
 20  Back:
 21 is      sortedClasscount [0] [0] - classification result
 22 is  "" " 
23 is  DEF classify0 (inX, dataSet a, Labels, k):
 24      # numpy function shape [0] of the number of rows returned dataSet (dimension) 
25      dataSetSize = dataSet.shape [0]
 26 is      # will inX dataSetSize times repeated and arranged in a row 
27      diffMat = np.tile (inX, (dataSetSize,. 1)) - dataSet A
 28      # dimensional feature after subtraction square
29      sqDiffMat diffMat = 2 **
 30      # SUM () sum all the elements, sum (0) column addition, sum (1) the addition line 
31 is      sqDistances = sqDiffMat.sum (Axis =. 1 )
 32      # prescribing calculated distance 
33 is      distances 0.5 ** = sqDistances
 34 is      # argsort distances function returns a value of from small to large index values 
35      sortedDistIndicies = distances.argsort ()
 36      # define a category dictionary number recorded 
37 [      classCount = {}
 38 is      # distance from minimum the k points 
39      for I in Range (k):
 40          # removed before the category k elements 
41 is          voteIlabel =Labels [sortedDistIndicies [I]]
 42 is          # dictionary get () method returns the value of the specified key, if the value is not in the dictionary returns 0 
43          # calculates the number of categories 
44 is          classCount [voteIlabel] = classCount.get (voteIlabel, 0) +. 1
 45      # Reverse dictionary in descending order, operator.itemgetter (1) sorted by the value (0) Sort button 
46 is      sortedClassCount the sorted = (classCount.items (), operator.itemgetter Key = (. 1), Reverse = True)
 47      # returns the number of most categories, i.e., to be classified category 
48      return sortedClassCount [0] [0]
 49  
50  # test 
51 is Group, Labels = CreateDataSet ()
 52 is classify0 ([0,0], Group, Labels,. 3)   # Output: 'love sheet'
View Code 

Actual: Handwriting Recognition System Digital

Here only the identification numbers 0 through 9, the image is monochrome image is 32 * 32 pixels, the image is converted to text format.

The formatting process is a vector image, the binary image matrix a vector of 32 * 32 1 * 2014.

1  "" " 
2  binary image converting 32 * 32 1 * 1024 vector
 3  parameters:
 . 4      filename - File name
 5  returns:
 . 6      returnVect - returns a binary image 1 * 1024 vector
 . 7  " "" 
. 8  
. 9  DEF img2vector (filename ):
 10      returnVect np.zeros = ((. 1, 1024 ))
 . 11      fr = Open (filename)
 12 is      # read in the row 
13 is      for I in Range (32 ):
 14          # read data line 
15          lineStr = fr.readline ( )
 16          # before the data of each row 32 are sequentially stored in returnVect
. 17          for J in Range (32 ):
 18 is              returnVect [0, 32 * I + J] = int (lineStr [J])
 . 19      # . 1 * 1024 vector after return switching 
20 is      return returnVect
 21 is  
22 is  # test 
23 is testVector = img2vector ( ' testDigits / 0_13.txt ' )
 24 testVector [0, 0:31 ]
 25  # Output: Array ([0, of 0. The, of 0. The, of 0. The, of 0. The, of 0. The, of 0. The, of 0. The, 0.. , 0.5, 0.5, 0.5, 0.5, 0.5, 1, 1, 1, 
26  #         1, 0.5, 0.5, 0.5, 0.5, 0.5, 0., 0. , 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
View Code

Test algorithm

1  "" " 
2  handwritten figures Category Test
 3  parameters:
 . 4      None
 . 5  returns:
 . 6      None
 . 7  " "" 
. 8  DEF handwritingClassTest ():
 . 9      # Labels test set 
10      hwLabels = []
 . 11      # File Name Return trainingDigits directory under 
12      = the listdir trainingFilesList ( ' trainingDigits ' )
 13 is      # returns the number of file folder 
14      m = len (trainingFilesList)
 15      # Mat matrix initialization training (all zeros needle), test set 
16     = np.zeros trainingMat ((m, 1024 ))
 . 17      # parse the file name from the category of the training set 
18 is      for I in Range (m):
 . 19          # obtain a file name 
20 is          fileNameStr = trainingFilesList [I]
 21 is          # obtain classification figure 
22 is          classNumber = int (fileNameStr.split ( ' _ ' ) [0])
 23 is          # category added to hwLabels obtained in 
24          hwLabels.append (classNumber)
 25          # the data store 1024 * 1 of each file to trainingMat matrix 
26 is          trainingMat [I,:] = img2vector ( ' trainingDigits / S% '% (FileNameStr))
 27      # configured KNN classifier 
28      neigh = KNN (N_NEIGHBORS =. 3, algorithm = ' Auto ' )
 29      # fit model, trainingMat for the test matrix, hwLabels corresponding tab 
30      neigh.fit (trainingMat, hwLabels)
 31      # file back testDigits directory list 
32      testFileList = the listdir ( ' testDigits ' )
 33 is      # error detection counter 
34 is      errorCount = 0.0
 35      # number of test data 
36      MTEST = len (testFileList)
 37 [      # parsed from the document set of tests category and classification test 
38     for I in Range (MTEST):
 39          # obtain a file name 
40          fileNameStr = testFileList [I]
 41 is          # numbers obtained classification 
42 is          classNumber = int (fileNameStr.split ( ' _ ' ) [0])
 43 is          # obtain a test set 1024 * vectors used to train 
44 is          vectorUnderTest = img2vector ( ' testDigits / S% ' % (fileNameStr))
 45          # obtain predictor 
46 is          classifierResult = neigh.predict (vectorUnderTest)
 47          Print ( " classification results returned% d \ t true the result is% d" % (ClassifierResult, classNumber))
 48          IF (classifierResult =! ClassNumber):
 49              errorCount + 1.0 =
 50      Print ( " total wrong data% d \ n error rate F %%% " % (errorCount, errorCount / MTEST * 100))

Guess you like

Origin www.cnblogs.com/harbin-ho/p/12026276.html