"Machine Learning in Practice" Chapter 2 K-Nearest Neighbor Algorithm Learning Summary

2.1 Algorithm overview

The k-nearest neighbor algorithm (KNN) algorithm uses the method of measuring the distance between different feature values ​​for classification

The working principle of the algorithm: there is a sample data set (training sample set), and each data in the sample set (a row vector of a series of sample elements) has a label (the classification work has been completed, and the corresponding relationship between each data and the category it belongs to). When a test sample is input (non-collection, unclassified data, no label exists), each feature of the new data is compared with the corresponding feature of the data in the sample set, and the algorithm extracts the classification label of the most similar data (nearest neighbor) in the training sample set . Generally, only the top k most similar data in the sample set are selected, k is usually an integer not greater than 20 , and finally the category with the most occurrences among the k similar data is selected as the new data category.

Pros: high precision, insensitive to outliers, no data input assumptions
Cons: high computational complexity, high space complexity
Data ranges used: numeric and nominal (limited sample set


Example: Taking movie genre classification as an example, the features in the training sample set (ie sample data or features, number of fighting scenes, kissing scenes) and feature labels (ie classification, movie types: romance, action) are known.

Input the test sample, calculate the distance between the feature value in the sample and the feature value in each sample in the training sample set, get the distance between known movies and location movies, arrange them in ascending order of distance, take the first k movies with k=3, according to this Several movie types judge the movie type so far.


The general flow of the k-nearest neighbor algorithm:

(1) Data collection: any method can be used
(2) Data preparation: the value required to calculate the distance, preferably in a structured data format
(3) Data analysis: any method can be used
(4) Training algorithm: this step does not Applicable to the k-nearest neighbor algorithm
(5) Test algorithm: calculate the error rate
(6) Use the algorithm: firstly, you need to input sample data and structured output results, then run the k-nearest neighbor algorithm to determine which category the input data belongs to, and finally apply the calculated The classification performs subsequent processing

2.1.1 Preparation: import data using python

Create a KNN.py file for creating a training sample set (can be called as a library function).

from numpy import *
import operator

def creatDataSet():
    group=array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels=['A','A','B','B']
    return group,labels

2.1.2 Parsing data from text files

K-Nearest Neighbor Algorithm Pseudocode:

For each point in the data set with unknown category attributes, perform the following operations in turn:
(1) Calculate the distance between the points in the known category data and the current point
(2) Sort in order of increasing distance
(3) Select the current point with the smallest distance k points
(4) Determine the frequency of occurrence of the category of the first k points
(5) Return the category with the highest frequency of occurrence of the first k points as the predicted classification of the current point

Program Listing 2-1 k-Nearest Neighbor Algorithm

There are 4 input parameters in the function, the input vector ( test sample ) inX for classification, the input training sample set is dataSet, the label vector is labels, the last parameter k indicates the number used to select the nearest neighbor, the label vector element and The training sample set has the same number of rows.

def classify0(inX,dataSet,labels,k):
    #计算欧式距离
    datasetSize=dataSet.shape[0]
    diffMat=tile(inX,(datasetSize,1))-dataSet
    sqDiffMat=diffMat**2
    sqDistances=sqDiffMat.sum(axis=1)
    distances=sqDistances**0.5
    
    sortedDistIndicies=distances.argsort()
    classCount={}
    
    #选择距离最小的k个点
    for i in range(k):
        voteIlabel=labels[sortedDistIndicies[i]]
        classCount[voteIlabel]=classCount.get(voteIlabel,0)+1
#排序
sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount

Euclidean distance calculation formula to calculate the distance between vector points xA and xB:

 After calculating the distance between all points, sort the data in ascending order; and determine the main classification (feature) of the first k elements with the smallest distance. The input k is always a positive integer; finally, the classCount dictionary is decomposed into The list of tuples, and then use the itemgetter method of the operator module imported in the second line of the program to sort the tuples in the order of the second element. The sorting here is in reverse order, that is, sorting from largest to smallest, and finally returning the element label with the highest occurrence frequency.

2.1.3 How to test the classifier

The classifier can be tested using a data set of known characteristics. With a large amount of test data, the error rate of the classifier can be obtained - the number of times the classifier gives wrong results divided by the total number of test executions. The error rate is mainly used to evaluate how well a classifier performs on a certain dataset.

2.2 Example: Using the k-Nearest Neighbor Algorithm to Improve Matching Effects on Dating Sites

Example: Using the k-Nearest Neighbors algorithm on a dating site:

(1) Data collection: provide text files.
(2) Prepare data: Use Python to parse text files.
(3) Analyze data: use Matplotlib to draw a two-dimensional diffusion map.
(4) Training algorithm: This step is not suitable for k-nearest neighbor algorithm.
(5) Test algorithm: Use part of the data provided by Helen as test samples.
   The difference between the test sample and the non-test sample is that the test sample is the data that has been classified. If the predicted classification is
   different from the actual category, it will be marked as an error.
(6) Use algorithm: generate a simple command line program, and then Helen can input some characteristic data to judge whether the other party is the
type she likes. 

2.2.1 Preparing data: parsing data from text files

Listing 2-2 converts text records into Numpy's parser

def file2matrix(filename):
    fr=open(filename)
    arrayOLines=fr.readlines()#读取文件所有行(直到结束符 EOF)并返回列表
    numberOfLines=len(arrayOLines)#得到文件行数
    returnMat=zeros((numberOfLines,3))#创建返回的numpy数组
    classLabeVector=[]
    index=0
    #对数据进行解析
    for line in arrayOLines:
        line =line.strip()#移除字符串头尾指定的字符(默认为空格或换行符)或字符序列,该处去除首尾空格符\n
        listFormLine=line.split('\t')#以\t为间隔拆分字符串,通过指定分隔符对字符串进行切片,并返回分割后的字符串列表(list)
        returnMat[index,:]=listFormLine[0:3]#将分割出的前3个字符串存入数组中,数组中第1、2、3数据分别表示特征“每年获得的飞行常客里程数”、“玩视频游戏所耗时间百分比”和“每周所消费的冰淇淋公升数”
        classLabeVector.append(listFormLine[-1])#将标签存入数组中
        index+=1
    return returnMat,classLabeVector

2.2.2 Analyzing Data: Creating Scatterplots Using Matplotlib

Use Matplotlib to make scatterplots of raw data.

datingDataMat,datingLabels=file2matrix('datingTestSet.txt')
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
plt.show()

Use the data in columns 2 and 3 of the datingDataMat matrix to represent the features "percentage of time spent playing video games" and "liters of ice cream consumed per week" respectively.

When a certain eigenvalue in the training sample set is much larger than other eigenvalues, the eigenvalue will seriously affect the calculation results. Therefore, when dealing with eigenvalues ​​of different value ranges, the numerical normalization method is usually used to process the eigenvalues ​​between 0 and 1 or -1 and 1.  

 max and min are the largest and smallest eigenvalues ​​in the dataset, respectively.

Listing 2-3 Normalized eigenvalues

def autoNorm(dataSet):
    minVals=dataSet.min(0)
    maxVals=dataSet.max(0)
    ranges=maxVals-minVals
    normDataSet=zeros(shape(dataSet))#生成一个和输入矩阵相同形状的零矩阵
    m=dataSet.shape[0]#取零矩阵的行数
    normDataSet=dataSet-tile(minVals,(m,1))#b = tile(a,(m,n)):即是把a数组里面的元素复制n次放进一个数组c中,然后再把数组c复制m次放进一个数组b中
    normDataSet=normDataSet/tile(ranges,(m,1))
    return normDataSet,ranges,minVals

2.2.4 Testing the algorithm: validating the classifier as a full program

The performance of the classifier is tested by the error rate, which is the number of times the classifier gives wrong results divided by the total number of test data.

Listing 2-4 Classifier Test Code for Dating Sites

def datingClassTest():
    hoRatio=0.10
    datingDataMat,datingLabels=file2matrix('datingTestSet.txt')#首先从文件中读取数据
    normMat, ranges, minVals = autoNorm(datingDataMat)#将特征值进行归一化处理
    m=normMat.shape[0]#获得向量数量
    numTestVec=int(m*hoRatio)#确定测试向量的数量
    errorCount=0.0
    #将测试向量输入分类器函数classify0,最后计算错误率并输出结果
    for i in range(numTestVec):
        classifierResult=classify0(normMat[i,:],normMat[numTestVec:m,:],datingLabels[numTestVec:m],3)
        print('the classifier came back with:%d,the real answer is :%d'% (classifierResult[0][0],datingLabels[i]))
        if(classifierResult[0][0]!=datingLabels[i]):
            errorCount+=1.0
    print('the total errot rate is :%f'%(errorCount/float(numTestVec)))

The error of the test result is 5%, and the correct rate of classification can be adjusted by changing the values ​​of the variables hoRatio and variable k in the function datingClassTest .

2.2.5 Using the Algorithm: Building a Complete Usable System

Listing 2-5 Dating site prediction function

def classifyPerson():
    resultList=['in large doses','in small doses','not at all']
    #获得对象的三个主要特征
    percentTats=float(input('percentage of time spent playing video games?'))
    ffMiles=float(input('frequent flier miles earned per year?'))
    iceCream=float(input('liters of ice cream consumed per year?'))
    datingDataMat, datingLabels = file2matrix('datingTestSet.txt')  # 从文件中读取数据
    normMat,ranges,minVals=autoNorm(datingDataMat)#将特征值进行归一化处理
    inArr=array([ffMiles,percentTats,iceCream])#将对象特征值便会数组
    classifierResult=classify0((inArr-minVals)/ranges,normMat,datingLabels,3)#将测试向量输入分类器函数classify0,进行分类
    print('you will probably like the person:',resultList[classifierResult[0][0]-1])

2.3 Handwriting recognition system

2.3.1 Prepare data: convert images into test vectors

Example: Handwriting Recognition System Using the k-Nearest Neighbors Algorithm

(1) Data collection: provide text files.
(2) Prepare data: Write the function classify0() to convert the image format into the list format used by the classifier.
(3) Analyze the data: Check the data in the Python command prompt to make sure it meets the requirements.
(4) Training algorithm: This step is not suitable for k-nearest neighbor algorithm.
(5) Test algorithm: Write the function to use the provided part of the data set as the test sample. The difference between the test sample and the non-test sample is
that the test sample is the data that has been classified. If the predicted classification is different from the actual category, it will be marked as
an error.
(6) Use algorithm: This step is not completed in this example. If you are interested, you can build a complete application program,
extract and complete number recognition. The mail sorting system in the United States is one by one. system.

function to convert an image into a vector: This function creates a 1 x 1024 NumPy array, then opens the given file, loops through the first 32 lines of the file, and stores the first 32 character values ​​of each line in the NumPy array, and finally returns an array.

#function6:32x32图像转1x1024向量函数
def img2vector(filename):
    retrunVect = zeros((1,1024))
    fr=open(filename)
    for i in range(32):
        linStr=fr.readline()
        for j in range(32):
            retrunVect[0,32*i+j]=int(linStr[j])
    return retrunVect

2.3.2 Test Algorithm: Recognize handwritten digits using the k-Nearest Neighbor Algorithm

Listing 2-6 Test code for the handwritten digit recognition system

def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('digits/trainingDigits')#获得训练集目录内容
    m = len(trainingFileList)#获取目录内文件名数量
    trainingMat = zeros((m,1024))
    #从文件名中解析出分类的数字
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]#以.作为分隔符,取左边第一个字符串
        classNumStr = int(fileStr.split('_')[0])#以_作为分隔符,取左边第一个字符串
        hwLabels.append(classNumStr)#将从文件名中分离出的数字加入到数组中
        trainingMat[i,:] = img2vector('digits/trainingDigits/%s' % fileNameStr)#将32x32图像转1x1024向量,存入矩阵中
    testFileList = listdir('digits/testDigits')#获得测试集目录内容
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('digits/testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print("the classifier came back with: %d, the real answer is: %d" % (classifierResult[0][0], classNumStr))
        if (classifierResult[0][0] != classNumStr):
            errorCount += 1.0
    print("\nthe total number of errors is: %d" % errorCount)
    print("\nthe total error rate is: %f" % (errorCount/float(mTest)))

 

 Changing the value of variable k, modifying the training samples randomly selected by the function, and changing the number of training samples can all change the accuracy of the classification function.

2.4 Summary

The k-nearest neighbor algorithm is the simplest and most effective algorithm for classifying data.

The k-nearest neighbor algorithm must save all data sets. If the training data set is large, a large amount of storage space must be used. In addition, it can be very time-consuming in practice because the distance value must be calculated for each data in the dataset. Another drawback of the k-nearest neighbor algorithm is that it cannot give any basic structure information of the data, so we cannot know the characteristics of the average instance sample and the typical instance sample.

Is there an algorithm to reduce the overhead of storage space and computing time? The k-decision tree is an optimized version of the k-nearest neighbor algorithm, which can save a lot of computing overhead.

from numpy import *
import operator
import matplotlib
import matplotlib.pyplot as plt
from os import listdir

def creatDataSet():
    group=array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels=['A','A','B','B']
    return group,labels

#function1:简单分类函数
def classify0(inX,dataSet,labels,k):
    # 计算欧式距离
    datasetSize=dataSet.shape[0]
    diffMat=tile(inX,(datasetSize,1))-dataSet
    sqDiffMat=diffMat**2
    sqDistances=sqDiffMat.sum(axis=1)
    distances=sqDistances**0.5
    sortedDistIndicies=distances.argsort()
    classCount={}

    # 选择距离最小的k个点
    for i in range(k):
        voteIlabel=labels[sortedDistIndicies[i]]
        classCount[voteIlabel]=classCount.get(voteIlabel,0)+1

    # 升序排序
    sortedClassCount=sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount

#fuction2:使用Matplotlib创建散点图
def file2matrix(filename):
    fr=open(filename)
    arrayOLines=fr.readlines()#读取文件所有行(直到结束符 EOF)并返回列表
    numberOfLines=len(arrayOLines)#得到文件行数
    returnMat=zeros((numberOfLines,3))#创建返回的numpy数组
    classLabeVector=[]
    index=0
    #对数据进行解析
    for line in arrayOLines:
        line =line.strip()#移除字符串头尾指定的字符(默认为空格或换行符)或字符序列,该处去除首尾空格符\n
        listFormLine=line.split('\t')#以\t为间隔拆分字符串,通过指定分隔符对字符串进行切片,并返回分割后的字符串列表(list)
        returnMat[index,:]=listFormLine[0:3]#将分割出的前3个字符串存入数组中,数组中第1、2、3数据分别表示特征“每年获得的飞行常客里程数”、“玩视频游戏所耗时间百分比”和“每周所消费的冰淇淋公升数”
        classLabeVector.append(int(listFormLine[-1]))#将标签存入数组中,1,2,3分别表示喜欢、一般喜欢、不喜欢
        index+=1
    return returnMat,classLabeVector

#function3:归一化特征值,该函数会自动对每列元素即所有特征值都进行归一化处理
def autoNorm(dataSet):
    minVals=dataSet.min(0)
    maxVals=dataSet.max(0)
    ranges=maxVals-minVals
    normDataSet=zeros(shape(dataSet))#生成一个和输入矩阵相同形状的零矩阵
    m=dataSet.shape[0]#取零矩阵的行数
    normDataSet=dataSet-tile(minVals,(m,1))#b = tile(a,(m,n)):即是把a数组里面的元素复制n次放进一个数组c中,然后再把数组c复制m次放进一个数组b中
    normDataSet=normDataSet/tile(ranges,(m,1))
    return normDataSet,ranges,minVals

#function4:分类器针对约会网站的测试代码
def datingClassTest():
    hoRatio=0.10
    datingDataMat,datingLabels=file2matrix('datingTestSet.txt')#首先从文件中读取数据
    normMat, ranges, minVals = autoNorm(datingDataMat)#将特征值进行归一化处理
    m=normMat.shape[0]#获得向量数量
    numTestVec=int(m*hoRatio)#确定测试向量的数量
    errorCount=0.0
    #将测试向量输入分类器函数classify0,最后计算错误率并输出结果
    for i in range(numTestVec):
        classifierResult=classify0(normMat[i,:],normMat[numTestVec:m,:],datingLabels[numTestVec:m],3)
        print('the classifier came back with:%d,the real answer is :%d'% (classifierResult[0][0],datingLabels[i]))
        if(classifierResult[0][0]!=datingLabels[i]):
            errorCount+=1.0
    print('the total errot rate is :%f'%(errorCount/float(numTestVec)))

#fuction5:约会网站预测函数
def classifyPerson():
    resultList=['in large doses','in small doses','not at all']
    #获得对象的三个主要特征
    percentTats=float(input('percentage of time spent playing video games?'))
    ffMiles=float(input('frequent flier miles earned per year?'))
    iceCream=float(input('liters of ice cream consumed per year?'))
    datingDataMat, datingLabels = file2matrix('datingTestSet.txt')  # 从文件中读取数据
    normMat,ranges,minVals=autoNorm(datingDataMat)#将特征值进行归一化处理
    inArr=array([ffMiles,percentTats,iceCream])#将对象特征值便会数组
    classifierResult=classify0((inArr-minVals)/ranges,normMat,datingLabels,3)#将测试向量输入分类器函数classify0,进行分类
    print('you will probably like the person:',resultList[classifierResult[0][0]-1])

#function6:32x32图像转1x1024向量函数
def img2vector(filename):
    retrunVect = zeros((1,1024))
    fr=open(filename)
    for i in range(32):
        linStr=fr.readline()
        for j in range(32):
            retrunVect[0,32*i+j]=int(linStr[j])
    return retrunVect

#function7:手写数字识别系统测试代码
def handwritingClassTest():
    hwLabels = []
    trainingFileList = listdir('digits/trainingDigits')#获得训练集目录内容
    m = len(trainingFileList)#获取目录内文件名数量
    trainingMat = zeros((m,1024))
    #从文件名中解析出分类的数字
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]#以.作为分隔符,取左边第一个字符串
        classNumStr = int(fileStr.split('_')[0])#以_作为分隔符,取左边第一个字符串
        hwLabels.append(classNumStr)#将从文件名中分离出的数字加入到数组中
        trainingMat[i,:] = img2vector('digits/trainingDigits/%s' % fileNameStr)#将32x32图像转1x1024向量,存入矩阵中
    testFileList = listdir('digits/testDigits')#获得测试集目录内容
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('digits/testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print("the classifier came back with: %d, the real answer is: %d" % (classifierResult[0][0], classNumStr))
        if (classifierResult[0][0] != classNumStr):
            errorCount += 1.0
    print("\nthe total number of errors is: %d" % errorCount)
    print("\nthe total error rate is: %f" % (errorCount/float(mTest)))




if __name__ == '__main__':
    # 简单分类函数示例
    # group,lables=creatDataSet()
    # result=classify0([1,1],group,lables,3)
    # print(result[0][0])

    #使用Matplotlib创建散点图
    # datingDataMat,datingLabels=file2matrix('datingTestSet.txt')
    # fig=plt.figure()
    # ax=fig.add_subplot(111)
    # ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))
    # plt.show()

    #归一化特征值
    # datingDataMat, datingLabels = file2matrix('datingTestSet.txt')
    # normMat,ranges,minVals=autoNorm(datingDataMat)

    #分类器测试
    # datingClassTest()

    #约会网站预测
    # classifyPerson()

    #图片向量转化
    # testVector=img2vector('digits/testDigits/0_13.txt')
    # print(testVector[0,0:31])

    #手写数字识别测试
    handwritingClassTest()

Guess you like

Origin blog.csdn.net/weixin_45182459/article/details/125975649