KNN改进和实战

上篇中介绍了K-近邻算法的主要原理，是一个非常基础、简单的分类器，下面我们主要来增加功能和小改进使其更加贴合实际
查看上篇点击这里
所有的数据和代码参考自《机器学习实战》一书

一、从文本文件中解析数据

假设我们有一个文本文件，每个样本的数据占一行，共有1000行，包含3种特征：
1、每年获得的飞行常客里程数
2、玩视频游戏所耗的时间百分比
3、每周消费的冰淇淋公升数
部分数据如下：

40920   8.326976   0.953952   3
14488  7.153469   1.673904   2
26052  1.441871   0.805124   1
75136  13.147394  0.428964   1
38344  1.669788   0.134296   1
72993  10.141740  1.032955   1

最后一列代表标签
数据转换代码如下：

def file2matrix(filename):
    fr = open(filename)
    # 获取行数
    arrayOLines = fr.readlines()
    numberOflines = len(arrayOLines)
    # print(numberOflines)
    # 构造矩阵
    returnMat = np.zeros((numberOflines, 3))
    # print(returnMat)
    classLabelVector = []
    index = 0
    # 写入矩阵
    for line in arrayOLines:
        # 去除所有的回车字符
        line = line.strip()
        # 数据分割为元素列表
        listFromLine = line.split('\t')
        # print(listFromLine)
        returnMat[index, :] = listFromLine[0: 3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat, classLabelVector

returnMat如下：

[[4.0920000e+04 8.3269760e+00 9.5395200e-01]
 [1.4488000e+04 7.1534690e+00 1.6739040e+00]
 [2.6052000e+04 1.4418710e+00 8.0512400e-01]
 ...
 [2.6575000e+04 1.0650102e+01 8.6662700e-01]
 [4.8111000e+04 9.1345280e+00 7.2804500e-01]
 [4.3757000e+04 7.8826010e+00 1.3324460e+00]]

datingLabels[:10]如下：

[3, 2, 1, 1, 1, 1, 3, 3, 1, 3]

二、分析数据

分析数据这一步很重要，主要是利用matplotlib对数据进行一定的可视化，观察数据的区分程度，各部分所占的比例等，为后续改进提供建议
通过如下代码我们可以对数据集进行可视化：

datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
# print(datingDataMat)
# print(datingLabels[:10])
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
num = len(datingDataMat)
for i in range(num):
    if datingLabels[i] == 1:
        ax.scatter(datingDataMat[i][0], datingDataMat[i][1], datingDataMat[i][2], c='b', marker='x')
    elif datingLabels[i] == 2:
        ax.scatter(datingDataMat[i][0], datingDataMat[i][1], datingDataMat[i][2], c='r', marker='o')
    elif datingLabels[i] == 3:
        ax.scatter(datingDataMat[i][0], datingDataMat[i][1], datingDataMat[i][2], c='g', marker='*')
plt.show()

图片如下

三、数据归一化

通过可视化，我们可以很清晰的看到数据不同的类别，其区分度还是可以的，但在计算距离时，如计算【0，20000，1.1，2】和【67， 32000， 0.1， 2】之间的距离时，公式如下：

$\sqrt{(0-67)^2+(20000-32000)^2+(1.1-0.1)^2}$

可以看出对距离影响最大的是数值最大的，这样的数据有一定的倾向性，假设我们认为三个特征同样重要，占有相同的权重，因此，需要对其进行归一化处理。
这里我们采用离差标准化来对数据进行归一化，离差标准化公式如下：

$X^*={X- min \over max-min}$

def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minVals, (m, 1))
    normDataSet = normDataSet/np.tile(ranges, (m, 1))
    return normDataSet, ranges, minVals

除了离差标准化外，常用的归一化方法还有标准差法、小数定标法等。
normMat如下：

[[0.44832535 0.39805139 0.56233353]
 [0.15873259 0.34195467 0.98724416]
 [0.28542943 0.06892523 0.47449629]
 ...
 [0.29115949 0.50910294 0.51079493]
 [0.52711097 0.43665451 0.4290048 ]
 [0.47940793 0.3768091  0.78571804]]

可以看到，结果归一化处理后并未改变数据的分布情况，但缩小了数据之间的差距。

四、测试算法

测试算法是机器学习很重要的一个步骤，用于衡量模型的分类性能，性能不佳时需要考虑从哪些方面可以入手对模型进行优化。
通常我们将已有的数据集的90%用于训练模型，而剩余的10%用于测试模型。注意具体测试集合训练集的划分有不同的方法，一般有留出法、自助法、交叉验证法等。代码如下：

def datingClassTest():
    hoRatio = 0.10
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classfiy0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
        print("the classifier came back with {}, the real answer is : {}".format(classifierResult, datingLabels[i]))
        if classifierResult != datingLabels[i]:
            errorCount += 1.0
    print("the total error rate is {}".format(errorCount/float(numTestVecs)))

输出：

the classifier came back with 3, the real answer is : 3
the classifier came back with 2, the real answer is : 2
...
the classifier came back with 2, the real answer is : 2
the total error rate is 0.05

可以看出错误率为2%，也就是说模型的准确性为95%，这是非常好的模型。

五、使用算法

在测试算法的准确性后，我们可以对其进行预测了。

def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(input('percentage of time spent playing video games?'))
    ffMiles = float(input('frequent flier miles earned per year?'))
    iceCream = float(input('liters of ice cream consumed per year?'))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = np.array([ffMiles, percentTats, iceCream])
    classifierResult = classfiy0((inArr - minVals)/ranges, normMat, datingLabels, 3)
    print('you will probably like this person:', resultList[classifierResult - 1])

调用classifyPerson后输入你所对应的信息，得到的预测结果如下：

percentage of time spent playing video games?>? 12
frequent flier miles earned per year?>? 20000
liters of ice cream consumed per year?>? 0.1
you will probably like this person: in large doses

六、完整代码如下

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib
import operator

def classfiy0(inX, dataset, labels, k):
    dataSetSize = dataset.shape[0]
    # print("dataSetSize", dataSetSize)
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataset
    # print("diffMat", diffMat)
    sqDiffMat = diffMat**2
    # print("sqDiffMat", sqDiffMat)
    sqDistances = sqDiffMat.sum(axis=1)
    # print("sqDistances", sqDistances)
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort()
    classCount = {
    
    }
    for i in range(k):
        voteILabel = labels[sortedDistIndicies[i]]
        classCount[voteILabel] = classCount.get(voteILabel, 0) + 1
    # print("classCount", classCount)
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    # print("sortedClassCount", sortedClassCount)
    return sortedClassCount[0][0]

def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOflines = len(arrayOLines)
    # print(numberOflines)
    returnMat = np.zeros((numberOflines, 3))
    # print(returnMat)
    classLabelVector = []
    index = 0
    for line in arrayOLines:
        line = line.strip()
        listFromLine = line.split('\t')
        # print(listFromLine)
        returnMat[index, :] = listFromLine[0: 3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat, classLabelVector

def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = np.zeros(np.shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - np.tile(minVals, (m, 1))
    normDataSet = normDataSet/np.tile(ranges, (m, 1))
    return normDataSet, ranges, minVals

def datingClassTest():
    hoRatio = 0.50
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classfiy0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
        print("the classifier came back with {}, the real answer is : {}".format(classifierResult, datingLabels[i]))
        if classifierResult != datingLabels[i]:
            errorCount += 1.0
    print("the total error rate is {}".format(errorCount/float(numTestVecs)))

def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(input('percentage of time spent playing video games?'))
    ffMiles = float(input('frequent flier miles earned per year?'))
    iceCream = float(input('liters of ice cream consumed per year?'))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = np.array([ffMiles, percentTats, iceCream])
    classifierResult = classfiy0((inArr - minVals)/ranges, normMat, datingLabels, 3)
    print('you will probably like this person:', resultList[classifierResult - 1])
    
classifyPerson()