K-Nearby Algorithm Experiment in Machine Learning Practice

1. Preparation: Import data using Python:

 

1. First, create a Python module named kNN.py and add the following code to the kNN.py file:

 

from numpy import *

import operator

def createDataSet():

    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])

    labels = ['A','A','B','B']

return group, labels

 

After entering the Python development environment, enter the following command:

 

 

Second, implement the kNN classification algorithm:

 

1. Add the following code to kNN.py:

 

def classify0(inX, dataSet, labels, k):

    dataSetSize = dataSet.shape[0]

    diffMat = tile(inX, (dataSetSize,1)) - dataSet

    sqDiffMat = diffMat**2

    sqDistances = sqDiffMat.sum(axis=1)

    distances = sqDistances**0.5

    sortedDistIndicies = distances.argsort()

    classCount={}

    for i in range(k):

        voteIlabel = labels[sortedDistIndicies[i]]

        classCount[voteIlabel] = classCount.get

(voteIlabel, 0) + 1

    sortedClassCount = sorted(classCount.items(),

     key=operator.itemgetter(1), reverse=True)

    return sortedClassCount[0][0]

 

After entering the Python development environment, enter the following command:

 

 

The result is B, correct.

 

3. Using the K-Nearest Neighbor Algorithm to Improve the Matching Effect of Dating Sites

 

Prepare the data: Parse the data from a text file:

    

1. Add the following code to kNN.py:

    

    def file2matrix(filename):

    fr = open(filename)

    arrayOLines = fr.readlines()

    numberOfLines = len (arrayOLines)

    returnMat = zeros((numberOfLines,3))

    classLabelVector = []

    index = 0

    for line in arrayOLines:

        line = line.strip()

        listFromLine = line.split('\t')

        returnMat[index,:] = listFromLine[0:3]

        classLabelVector.append(int(listFromLine[-1]))

        index += 1

return returnMat,classLabelVector

 

 

When entering reload(kNN) in the Python command prompt, it always reports an error, and Baidu can't solve it, and it needs to be solved. After crossing this command, the data results are different.

 

Fourth, analyze the data: use Matplotlib to create a scatter plot

 

1. First, we use Matplotlib to make a scatter plot of the original data, see the following figure:

 

 


2. Since the eigenvalues ​​from the samples are not used, it is difficult for us to see any useful data pattern information from the graph. The Matplotlib library provides the scstter function to support personalized labeling of points on a scatterplot. Call the scatter function. Look at the picture below:

 

The above image error occurs in the Python command prompt, this error is a problem with my version of the Matplotlib library.

Running in Jupyter can be successful, the second and third column attributes of the datingDataMat matrix, see the following figure:

 

 

The first and second column attributes of the datingDataMat matrix, see the following figure:

 

 

 

 

5. Prepare the data: normalized values

 

1. Add the following code to kNN.py:

 

 def autoNorm(dataSet):

    minVals = dataSet.min(0)

    maxVals = dataSet.max(0)

    ranges = maxVals - minVals

    normDataSet = zeros(shape(dataSet))

    m = dataSet.shape[0]

    normDataSet = dataSet - tile(minVals, (m,1))

    normDataSet = normDataSet/tile(ranges, (m,1))

    #element wise divide

return normDataSet, ranges, minVals

 

6. Testing the Algorithm: Verifying the Classifier as a Complete Program

 

1. Add the following code to kNN.py:

 

def datingClassTest():

    hoRatio = 0.1

    datingDataMat, datingLabels = file2matrix ('datingTestSet2.txt')

    normMat, ranges, minVals = autoNorm(datingDataMat)

    m = normMat.shape[0]

    numTestVecs = int(m*hoRatio)

    errorCount = 0.0

    for i in range(numTestVecs):

        classifierResult=  classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)

        Print( "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]))

        if (classifierResult != datingLabels[i]): errorCount += 1.0

print ("the total error rate is: %f" % (errorCount/float(numTestVecs)))

 

 

7. Using Algorithms: Building a Complete Usable System

 

1. Add the following code to kNN.py:

 

def classifyPerson():  

    resultList = ["not at all","in small does","in large does"]  

    percentTats = float(input("percentage of time spent playing video games?"))  

    ffMiles = float(input("frequent flier miles earned per year?"))  

    iceCream = float(input("liters of ice cream consumes per year?"))  

    datingDataMat, datingLabels = file2matrix ('datingTestSet2.txt')  

    normMat,ranges,minVals = autoNorm(datingDataMat)  

    inArr = array([ffMiles,percentTats,iceCream])  

classifierResult = classify0(((inArr-minVals)/ranges),datingDataMat,datingLabels,3)

print("You will probably like this person:",resultList[classifierResult - 1])

 

 

Eight, handwriting recognition system

    

Prepare data: convert images to test vectors

 

1. Add the following code to kNN.py:

 

def img2vector(filename):

    returnVect = zeros((1,1024))

    fr = open(filename)

    for i in range(32):

        lineStr = fr.readline()

        for j in range(32):

            returnVect[0,32*i+j] = int(lineStr[j])

return returnVect

 

 

Nine, test algorithm: use k-nearest neighbor algorithm to recognize handwritten digits

 

1. Add the following code to kNN.py:

 

from os import listdir

def handwritingClassTest():

    hwLabels = []

    trainingFileList = listdir('trainingDigits')

    m = len(trainingFileList)

    trainingMat = zeros((m,1024))

    for i in range(m):

        fileNameStr = trainingFileList[i]

        fileStr = fileNameStr.split('.')[0]

        classNumStr = int(fileStr.split('_')[0])

        hwLabels.append(classNumStr)

        trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr)

    testFileList = listdir('testDigits')

    errorCount = 0.0

    mTest = len(testFileList)

    for i in range(mTest):

        fileNameStr = testFileList[i]

        fileStr = fileNameStr.split('.')[0]

        classNumStr = int(fileStr.split('_')[0])

        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)

        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)

        print ("the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr))

        if (classifierResult != classNumStr):errorCount += 1.0

    print ("\nthe total number of errors is: %d" % errorCount)

    print ("\nthe total error rate is: %f" % (errorCount/float(mTest)))



Operation Abnormal Problems and Solutions

1. Preparation: Import data using Python:

1. Entering the command import kNN in the Python environment prompts an error, No module named numpy,

I found that numpy was not installed on my computer. After downloading the corresponding version and installing it online, the problem was solved.

 

Second, implement the kNN classification algorithm:

1. In Python3.X, item() replaces iteritems(), the source code of textbook is iteritems(), I am using Python3.6.3 version, so change iteritems() to item(), the problem is solved.

2. After modifying the code, restart the Python environment, directly enter kNN.classify0([0.0],group,labels,3), and find an error. In this case, you need to repeat the step 1 command.

 

3. Prepare the data: parse the data from the text file

1. When you enter reload(kNN) in the Python command prompt, it always reports an error, and Baidu can't solve it. It needs to be solved. After crossing this command, the data results are different.

2. In the Python command prompt:

ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels)) failed to compile, the prompt message is: name 'array'is not defined, this error It was a problem with my version of Python3.6.3 and Matplotlib.

 

6. Testing the Algorithm: Verifying the Classifier as a Complete Program

1. The two print statements in the class schedule code are each missing a pair of parentheses, and please remove the raw_ from the print.

 

7. Using Algorithms: Building a Complete Usable System

1. Each print statement in the class schedule code is missing a pair of parentheses.

 

Nine, test algorithm: use k-nearest neighbor algorithm to recognize handwritten digits

 

1. If the experiment is done in the Python prompt, the testDigits and trainingDigits files should be placed in the Python installation directory.

2. If you are experimenting in jupyter, the path should be changed to a detailed path, such as: F:\Softwares\Python\testDigits; F:\Softwares\Python\trainingDigits

 


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326314390&siteId=291194637