机器学习——kNN

最近在学习《机器学习实战》

    kNN算法是从训练集中找到和新数据最接近的k条记录(欧氏距离),然后根据他们的主要分类来决定新数据的类别。该算法涉及3个主要因素:训练集、距离或相似的衡量、k的大小。

一、运行kNN算法

    kNN算法可以解决如下问题,样本如下:


span group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])  
labels = ['A','A','B','B']  

然后要判断[1.1,1.2],[0.1,0.2]属于哪一类,首先创建kNN.py文件导入数据


def createDataSet():
    group = numpy.array([[1.0,1.1], [1.1,1.0], [0,0], [0,0.1]])
    labels = ['A','A','B','B']
    return group,labels

然后我们使用kNN算法

对待测样本点执行以下操作

1、计算待测点与样本点的欧氏距离;

2、按距离递增次序排列;

3、选择前k个点,计算其对应的标签,对标签次数按降序排列;

4、选择出现次数最多的标签作为kNN算法的预测结果

代码如下:


# K-近邻算法
def classify0(inX,dataSet,labels,k):
    dataSetSize = dataSet.shape[0]
    # 计算距离(欧氏距离)
    diffMat = numpy.tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    # 对距离排序返回排序后的索引
    sortedDistIndicies = distances.argsort()
    # 定义一个空的字典
    classCount = {}
    #选择距离最小的K个点
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
    #排序
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

然后去当前文件目录下打开控制台启动python,进入Python交互式环境执行以下命令(这是一个简单的分类器)

>>>import kNN  
    >>> group,labels=kNN.createDataSet()  
    >>> kNN.classify0([1.1,1.2],group,labels,3)  
    'A'  
    >>> kNN.classify0([0.1,0.2],group,labels,3)  
    'B'  
    >>>  

二 运用kNN解决网站约会配对效果

数据集存放在文本文件datingTestSet.txt文件中,每个样本占据一行,一共1000行,主要包括了以下3个特征:

1、每年获得的飞行常客里程数

2、玩视频游戏所消耗时间

3、每周消费的冰淇淋公升数

2.1 从文本中解析数据并分析

将上述数据输入到分类器之前,需要将数据的格式处理为分类器可以接受的格式,在kNN.py中创建名为file2matrix函数,来处理输入格式问题,该程序如下:

# 将约会数据文本记录转化为numpy的解析程序
def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    # 得到文件的行数
    numberOfLines = len(arrayOLines)
    # 创建返回Numpy的矩阵
    returnMat = numpy.zeros((numberOfLines, 3))
    classLabelVector = []
    index = 0
    # 解析文件数据到列表
    for line in arrayOLines:
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat,classLabelVector

重新载入kNN.py(Python2版本是直接reload(kNN),但是Python3.6是 import importlib; importlib.reload(kNN))

再利用Matpoltlib可以创建散点图,观察数据分布:

>>> import kNN  
>>> datingDataMat,datingLabels = kNN.file2matrix('datingTestSet.txt')  
>>> import matplotlib  
>>> import matplotlib.pyplot as plt  
>>> fig=plt.figure()  
>>> ax=fig.add_subplot(111)  
>>> ax.scatter(datingDataMat[:,1],datingDataMat[:,2],15.0*array(datingLabels),15.0*array(datingLabels))   
>>> plt.show()

    这里使用datingDataMat矩阵第二和第三列数据方分别表示“玩游戏所耗时间百分比”和“每周消耗冰激凌公升数”。得到效果图如下:



2.2 归一化数据 

在进行测试之前我们需要对数据进行归一化处理,不然数值大的属性对距离计算的影响十分巨大,所以我们需要将数据值处理到0到1 之间或者-1到1 之间,利用如下公式:

newValue = (oldValue - min) / (max - min)

我们需要在kNN.py文件里添加函数autoNorm(),该函数自动将数值转化到0和1之间:

# 归一化特征值(约会数据)
def autoNorm(dataSet):
    # 取最小值和最大值并计算差值
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    # 建立一个1000*3的矩阵 值都是0
    normDataSet = numpy.zeros(numpy.shape(dataSet),dtype=float)
    # 取dataSet的维度 1000
    m = dataSet.shape[0]
    # 利用公式进行归一化( newVal = (oldVal - min)/ranges )
    normDataSet = dataSet - numpy.tile(minVals, (m, 1))
    normDataSet = normDataSet/numpy.tile(ranges,(m, 1))
    return normDataSet,ranges,minVals

在Python命令提示符下,重新加载kNN.py模块,执行autoNorm函数,检测执行效果

>>> import importlib
>>> importlib.reload(kNN)
<module 'kNN' from 'E:\\pyCharm\\workspace\\test1\\learning\\kNN.py'>
>>> norMat,ranges,minVals = kNN.autoNorm(datingDataMat)
[[  4.09200000e+04   8.32697600e+00   9.52796000e-01]
 [  1.44880000e+04   7.15346900e+00   1.67274800e+00]
 [  2.60520000e+04   1.44187100e+00   8.03968000e-01]
 ...,
 [  2.65750000e+04   1.06501020e+01   8.65471000e-01]
 [  4.81110000e+04   9.13452800e+00   7.26889000e-01]
 [  4.37570000e+04   7.88260100e+00   1.33129000e+00]]
[[  9.12730000e+04   2.09193490e+01   1.69436100e+00]
 [  9.12730000e+04   2.09193490e+01   1.69436100e+00]
 [  9.12730000e+04   2.09193490e+01   1.69436100e+00]
 ...,
 [  9.12730000e+04   2.09193490e+01   1.69436100e+00]
 [  9.12730000e+04   2.09193490e+01   1.69436100e+00]
 [  9.12730000e+04   2.09193490e+01   1.69436100e+00]]

2.3 测试算法

机器学习的算法最重要的是要保证算发的正确率,这个例子用90%作为训练样本,10%作为测试,因为数据是随机分布的,索引选择前10%的数据用来测试,在文件中创建datingClassTest函数:

# 分类器针对约会网站的测试代码
def datingClassTest():
    # 取10%的数据进行测试
    hoRatio = 0.10
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
    normMat,ranges,minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
        print("the classfier came back with: %d,the real answer is : %d"%(classifierResult,datingLabels[i]))
        if (classifierResult != datingLabels[i]):
            errorCount += 1.0
    # 计算处理数据的错误率
    print("the total error rate is: %f"%(errorCount/float(numTestVecs)))

在Python命令提示符下,重新加载kNN.py模块,执行该函数,检测执行效果:

>>> import importlib
>>> importlib.reload(kNN)
<module 'kNN' from 'E:\\pyCharm\\workspace\\test1\\learning\\kNN.py'>
>>> kNN.datingClassTest()
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 3
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 3,the real answer is : 3
the classfier came back with: 2,the real answer is : 2
the classfier came back with: 1,the real answer is : 1
the classfier came back with: 3,the real answer is : 1
the total error rate is: 0.050000

由此可见,错误率为5%,可以改变函数datingClassSet内变量hoRatio和变量k的值,检测错误率是否随着变量值的变化而变化。依赖于分类算法、数据集和程序设置,分类器的输出结果都是不同的。

2.4 使用算法,构建完整系统

将下列代码加入kNN.py中,并重新载入:

# 约会网站预测函数
def calssifyPerson():
    resultList = ['not al all','is small doses','in large doses']
    percentTats = float(input("percentage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = numpy.array([ffMiles,percentTats,iceCream])
    classifierResult = classify0((inArr - minVals)/ranges,normMat,datingLabels,3)
    print("you will probably like this person:",resultList[classifierResult - 1])

    执行上述函数,输入摸个用户三个特征的值,并返回判断结果:

>>> import importlib
>>> importlib.reload(kNN)
<module 'kNN' from 'E:\\pyCharm\\workspace\\test1\\learning\\kNN.py'>
>>> kNN.calssifyPerson()
percentage of time spent playing video games?10
frequent flier miles earned per year?10000
liters of ice cream consumed per year?0.5
you will probably like this person: is small doses


猜你喜欢

转载自blog.csdn.net/songzhiren5560/article/details/80452049