Machine learning combat-k-nearest neighbor algorithm

Algorithm flow

1. Collect data

2. Calculate the distance between the target data and the sample data

3. Sort the calculated distances in ascending order

4. Select the k sample data with the smallest distance from the target data

5. Count the frequency of classification of the first k sample data

5. The category with the highest frequency of classification corresponding to the first k sample data values ​​is the category of the target data

Distance calculation method

Use the Euclidean distance formula to calculate the distance between two vector points xA and xB:

Insert picture description here

Code

from numpy import *
import operator  # 运算符模块


# 导入数据
def createDataSet():
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels

#k-近邻算法
def classify(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]  # 样本数据集样本数目
    diffMat = tile(inX, (dataSetSize, 1)) - dataSet  # 将目标数据与每个样本数据做差
    sqDiffMat = diffMat ** 2  # 将矩阵每个元素平方
    sqDistances = sqDiffMat.sum(axis=1)  # 将矩阵中每行元素相加
    distances = sqDistances ** 0.5  # 将矩阵中每个元素开方得到距离

    # 将距离递增排序
    sortedDistIndicies = distances.argsort()  # 返回排序后索引

    # 统计前k个样本数据的分类出现频率
    classCount = {
    
    }  # 创建空字典
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]  # 根据距离递增依次获取样本对应的标签
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1  # 对应标签计数加一
    # 对前k个样本数据分类按照出现频率排序 reverse=True降序
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]


#测试数据
group, labels = createDataSet()
print(classify([0, 0], group, labels, 3))

Example

Example 1: Using the k-nearest neighbor algorithm on a dating website

1. Prepare the data: parse the data from the text file

# 将文本记录转换成NumPy
def file2matrix(filename):
    fr = open(filename)  # 打开文本文件
    arrayOLines = fr.readlines()  # 按行读取整个文件,返回list
    numberOfLines = len(arrayOLines)  # 每行内容长度
    returnMat = zeros((numberOfLines, 3))  # 创建一个与文本文件中数据行数对应,列数为3的0矩阵
    classLabelVector = []  # 创建类标签列表
    index = 0
    # 解析文件数据到列表
    for line in arrayOLines:
        line = line.strip()  # 去掉文本中的空格
        listFromLine = line.split('\t')  # 使用tab将整行的数据分割成元素列表
        returnMat[index, :] = listFromLine[0:3]  # 将分割出来的元素赋值给创建的0矩阵
        # 把该样本对应的标签放至标签集,顺序与样本集对应
        labels = {
    
    'didntLike': 1, 'smallDoses': 2, 'largeDoses': 3}
        # python语言中可以使用-1表示列表中的最后一列元素
        classLabelVector.append(labels[listFromLine[-1]])
        index += 1
    return returnMat, classLabelVector

Insert picture description here

2. Analyze the data: use Matplotlib to create a scatter plot

# 使用Matplotlib创建散点图
fig = plt.figure()  # 新建一个画布
ax = fig.add_subplot(111)  # 1*1网格,第一子图
ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2], 15.0 * array(datingLabels),
           15.0 * array(datingLabels))  # 散点图---》绘制矩阵的第二、三列
#scatter(x,y,s=1,c="g",marker="s",linewidths=0)
#s:散列点的大小,c:散列点的颜色,marker:形状,linewidths:边框宽度
plt.show()

Insert picture description here

3. Prepare the data: normalized value

newValue = (oldValue-min)/(max-min) where min and max are the smallest and largest eigenvalues ​​in the data set, respectively

# 归一化特征值
def autoNorm(dataSet):
    minVals = dataSet.min(0)
    # min(0)返回该矩阵中每一列的最小值
    # min(1)返回该矩阵中每一行的最小值
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals, (m, 1))  # 数据集中的数据与minVals做差
    normDataSet = normDataSet / tile(ranges, (m, 1)) #(oldValue-min)/(max-min)
    return normDataSet, ranges, minVals

Insert picture description here

4. Test algorithm: as a complete program to verify the classifier

Use 90% of the existing data as a training sample to train the classifier, and use the remaining 10% of the data to test the classifier

Take out 100 data out of 1000 data to verify the classifier

# 分类器针对约会网站的测试代码
def datingClassTest():
    hoRatio = 0.10
    datingDataMat, datingLabels = file2matrix('datingTestSet.txt')  # 从文档解析数据
    normDataSet, ranges, minVals = autoNorm(datingDataMat)  # 归一化处理数据
    m = normDataSet.shape[0]
    numTestVecs = int(m * hoRatio)
    errorCount = 0.0  # 错误计数器
    for i in range(numTestVecs):
        classifierResult = classify(normDataSet[i, :], normDataSet[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
        print("the classifier came back with: %d,the real answer is  :%d" % (classifierResult, datingLabels[i]))
        if (classifierResult != datingLabels[i]): errorCount += 1.0
        print("the total error rate is:%f" % (errorCount / float(numTestVecs)))

Insert picture description here

5. Use algorithm: build a complete usable system

Input a set of data and use the classifier to predict the result of the data

#约会网站预测函数
def classifyPerson():
    resultList = ['not at all','in small doses','in large doses']
    percentTats = float(input("percentage of time spent playing video games?"))
    ffMiles = float(input("frequent  flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat,datingLabels = file2matrix('datingTestSet1.txt') # 从文本文档解析数据
    normMat,ranges,minVals = autoNorm(datingDataMat) # 将数据归一化处理
    inArr = array([ffMiles,percentTats,iceCream])
    classifierResult = classify((inArr-minVals)/ranges,normMat,datingLabels,3) #将目标数据归一化处理后进行分类
    print("You will probably like this person:",resultList[classifierResult-1]) # 输出对目标数据预测的结果

Insert picture description here

Example 2: Handwriting recognition system

For the sake of simplicity, the system constructed here can only recognize the numbers 0 to 9. Although the use of text format to store images cannot effectively use memory space, for ease of understanding, we still convert the images to text format.

1. Prepare the data: convert the image to a test vector

Process the image format as a vector. The image is a binary graphics matrix of 32 32, which is transformed into a 1 1024 vector.

# 将图像转化为向量
def img2vector(filename):
    returnVec = zeros((1, 1024))  # 构造一个1*1024的0矩阵
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()  # 只读取文件的当前行,返回str
        for j in range(32):
            returnVec[0, 32 * i + j] = int(lineStr[j])  # 将图片中数字赋值到新的向量的对应位置
    return returnVec

2. Test algorithm: Recognize handwritten digits using k-nearest neighbor algorithm

Input the data in the processed vector into the classifier to test the performance of the classifier.

# 手写数字识别系统的测试代码
def handwritingClassTest():
    hwLabels = []
    trainingFileList = os.listdir('trainingDigits')  # 返回path指定的目录中的条目的名称(任意顺序)
    m = len(trainingFileList)  # 统计文件夹中的数据条目数量
    trainingMat = zeros((m, 1024))  # 创建一个m行1024列的零矩阵
    for i in range(m):
        fileNameStr = trainingFileList[i]  # 获取对应位置文件名
        fileStr = fileNameStr.split('.')[0]  # 用.分割文件名去掉文件后缀
        classNumberStr = int(fileStr.split('_')[0])  # 获取图片文件对应的数字
        hwLabels.append(classNumberStr)  # 将对应数字加入到标签列表中
        trainingMat[i, :] = img2vector('trainingDigits/%s' % fileNameStr)  # 将文件夹列表中的文件一次转换成向量存入向量集trainingMat中
    testFileList = os.listdir('testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumberStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
        classifierResult = classify(vectorUnderTest, trainingMat, hwLabels, 3)
        print("the classifier came back with:%d,the real answer is:%d" % (classifierResult, classNumberStr))
        if (classifierResult != classNumberStr): errorCount += 1.0
        print("\nthe total number of errors is:%d" % errorCount)
        print("\nthe total error rate is:%f" % (errorCount / float(mTest)))

summary

  • The k-nearest neighbor algorithm is the simplest and most effective algorithm for classifying data .
  • The k-nearest neighbor algorithm must save all data sets, and the training data set is too large and will use a lot of storage space.
  • The k-nearest neighbor algorithm must calculate the distance value for each data in the data set , which is very time-consuming.
  • The k-nearest neighbor algorithm cannot give any basic structure information of the data , and cannot know the characteristics of the average instance sample and the typical instance sample.

Guess you like

Origin blog.csdn.net/qq_35134206/article/details/109068693