KNN algorithm practice 3

KNN algorithm series courses are here:

KNN Algorithm Practice 1
KNN Algorithm Practice 2
Objectives:

1. Master the usage of the sorted function key
sortedClassCount = sorted(classCount.items(), key = operator.itemgetter(1), reverse = True)
2. Master the usage of os.dir()
trainingFileList = os.listdir('digits/trainingDigits' )
3. Understand why kNN is garbage

handwriting recognition system

1. The system constructed here can only recognize 0-9, and the image has been processed into the same size of 32x32. If it is 320x320, I estimate that my notebook will vomit blood! It seems that the current machine learning cannot easily identify and train any image
. 2. For convenience, here we convert the image to text format, 32x32 expansion, which is a 1x1024 vector. If you need a file,
3. Baidu can find 机器学习实践the source file, I'm not responsible for providing these, um... well, the text file is here! Just click and it will fail. Remember to say.
Alright, let's get started

import numpy as np
import operator
import  os
# 1.32x32转化为向量
def img2vector(filename):
    returnVect = np.zeros((1,1024))
    fr =open(filename)
    for i in range(32):
        # 每次读取一行
        lineStr = fr.readline()
        #将每一行的每一个元素写入到向量中
        for j in range(32):
            returnVect[0,32*i+j]=int(lineStr[j])
    return returnVect
# 先测试一下吧
testVector = img2vector('digits/testDigits/0_13.txt')
print(testVector[0,0:31])
# >>[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0.
#  0. 0. 0. 0. 0. 0. 0.]
# 看来已经读取好了
# 这是分类器
def classify0(inX,dataSet,labels,k):
    dataSetSize = dataSet.shape[0] #取第一个维度,计算样本的个数
    diffMat = np.tile(inX,(dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    #argsort()是将元素从小到大排列,提取其对应的index(索引),然后输出
    sortedDisIndicies = distances.argsort()
    #创建一个dict
    classCount ={}
    for i in range(k):
        #获得距离最小前K个的标签
        voteIlabel = labels[sortedDisIndicies[i]]
        # dict.get(key, default=None)函数,key就是dict中的键voteIlabel,
        # 如果不存在则返回一个0并存入dict,如果存在则读取当前值并 + 1;
        # 这样操作后我们可能得到{'A':1,'B':1,'A':1}
        classCount[voteIlabel] = classCount.get(voteIlabel,0) +1

    sortedClassCount = sorted(classCount.items(),\
                       key = operator.itemgetter(1),reverse = True)
    return sortedClassCount[0][0]
# 2.测试分类器
def handwritingClassTest():
    hwLabels=[]
    # os.listdir可以列出给定目录的文件名
    #返回4_135.txt','4_136.txt', '4_137.txt',这样目的是为了做Label呢
    #我们现在只有提取每一行第一个数字,就是其对应的Label了
    trainingFileList = os.listdir('digits/trainingDigits')
    m =len(trainingFileList)# 这里我们有1934个样本
    trainingMat = np.zeros((m,1024))
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0] #提取出4_139
        classNumStr = int(fileStr.split('_')[0])#提取出0-9
        hwLabels.append(classNumStr)#Label制作完成
        trainingMat[i,:] = img2vector('digits/trainingDigits/%s'%(fileNameStr))

    testFileList =os.listdir('digits/testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('digits/testDigits/%s'%(fileNameStr))
        classifierResult = classify0(vectorUnderTest,trainingMat,hwLabels,3)
        print('the classifier came back with:%d,the real answer is:%d'\
              %(classifierResult,classNumStr))
        if (classifierResult !=classNumStr):
            errorCount+=1.0
    print('\n the total number of errors is:%d'%errorCount)
    print('\n the total error rates is %f'%(errorCount/float(mTest)))


  #好了,已经写完了,下面进行验证吧
handwritingClassTest()
'''
 the total number of errors is:10

 the total error rates is 0.010571
'''

Immediately we can see the drawbacks, kNN is strictly a fool's practice, there is no training process, it is just a simple comparison, but it is not satisfied in the actual process, which makes us have to abandon this method. In the future, if we run with 50,000 images, we will find that the accuracy rate is very low (29%). For images, the only way is deep learning...
Well, today's introduction, that's all, KNN has been introduced a lot, let's learn about decision trees in the next section! !
come on! You are the most 'fat'! ! !

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324696409&siteId=291194637