第二章

1.简单KNN模型

1.1创建数据集

from numpy import *
import operator

def createDataSet():

    group = array([[1.0,1.1],[1.0,1.0],[0,0],[
    labels = ['A','A','B','B']

return group, labels

测速创造的数据集

>>> import kNN
>>> group,labels = kNN.createDataSet()
>>> group
array([[ 1. , 1.1],
[ 1. , 1. ],
[ 0. , 0. ],
[ 0. , 0.1]])
>>> labels
['A', 'A', 'B', 'B']

1.2KNN分类

def classify0(intX, dataSet, labels, k):
	dataSetSize = dataSet.shape[0]
	# 距离计算
	differMat = tile(intX, (dataSetSize, 1)) - dataSet
	sqDistances = differMat**2
	# axis=0按列加，axis=1按行加
	distances = sqDistances.sum(axis=1)
	# argsort函数返回的是数组值从小到大的索引值
	sortedDistIndicies = distances.argsort()
	classCount = {}
	# 选择距离中最小的k个点
	for i in range(k):
		voteILabel = labels[sortedDistIndicies[i]]
		classCount[voteILabel] = classCount.get(voteILabel, 0) + 1
		# 'dict' object has no attribute 'iteritems'
		# iteritems变为items
		# 排序
		sortedClassCount = sorted(classCount.items(),
		                          key=operator.itemgetter(1),
		                          reverse=True)
		return sortedClassCount[0][0]

测试分类

>>> kNN.classify0([0,0], group, labels, 3)

结果返回分类结果为'B'

2.改进约会网站

2.1背景说明

对某个男生三个方面的数据进行预测他是否是自己喜欢的、要去约会的类型

一年飞行时间
玩游戏所占时间百分比
一星期消耗的冰淇淋升数

源文件格式如下：

40920	8.326976	0.953952	largeDoses
14488	7.153469	1.673904	smallDoses
26052	1.441871	0.805124	didntLike
75136	13.147394	0.428964	didntLike
38344	1.669788	0.134296	didntLike
72993	10.141740	1.032955	didntLike
35948	6.830792	1.213192	largeDoses
42666	13.276369	0.543880	largeDoses

我们将源文件最后一列量化为数字其中数字

3表示largeDoses

2表示smallDoses

1表示didntLike

更改后文件格式如下：

40920	8.326976	0.953952	3
14488	7.153469	1.673904	2
26052	1.441871	0.805124	1
75136	13.147394	0.428964	1
38344	1.669788	0.134296	1
72993	10.141740	1.032955	1
35948	6.830792	1.213192	3
42666	13.276369	0.543880	3

2.2读取并处理数据

def file2matrix(filename):
	fr = open(filename)
	array0lines = fr.readlines()
	# 得到文件的长度
	numberOfLines = len(array0lines)
	# 创建返回矩阵
	returnMat = zeros((numberOfLines, 3))
	classLabelVector = []

	index = 0
	# 解析文件数据到列表
	for line in array0lines:
                # 去除回车符
		line = line.strip()
                # 将每一行使用tab字符分开，形成一个列表
		listFromLine = line.split('\t')
                # 表示第index行用后面数组中的0-2列来填充
		returnMat[index, :] = listFromLine[0:3]
                # 取出最后一列作为标签赋值给classLabelVector数组
		classLabelVector.append(int(listFromLine[-1]))
		index += 1
	return returnMat, classLabelVector

测试数据处理结果:

>>> from imp import reload
>>> reload(kNN)
>>> datingDataMat,datingLabels = kNN.file2matrix('datingTestSet.txt')
>>> datingDataMat
array([[ 7.29170000e+04, 7.10627300e+00, 2.23600000e-01],
    [ 1.42830000e+04, 2.44186700e+00, 1.90838000e-01],
    [ 7.34750000e+04, 8.31018900e+00, 8.52795000e-01],
    ...,
    [ 1.24290000e+04, 4.43233100e+00, 9.24649000e-01],
    [ 2.52880000e+04, 1.31899030e+01, 1.05013800e+00],
    [ 4.91800000e+03, 3.01112400e+00, 1.90663000e-01]])
>>> datingLabels[0:20]
['didntLike', 'smallDoses', 'didntLike', 'largeDoses', 'smallDoses',
'smallDoses', 'didntLike', 'smallDoses', 'didntLike', 'didntLike',
'largeDoses', 'largeDose s', 'largeDoses', 'didntLike', 'didntLike',
'smallDoses', 'smallDoses', 'didntLike', 'smallDoses', 'didntLike']

2.3数据可视化分析

2.3.1安装maplotlib

首先进行安装matplotlib库

$ pip install -i https://pypi.douban.com/simple  matplotlib

这里为了加快安装速度，使用了豆瓣源。

2.3.2作图

>>> import matplotlib
>>> import matplotlib.pyplot as plt
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> ax.scatter(datingDataMat[:,1], datingDataMat[:,2])
>>> plt.show()

首先使用数据的第二列和第三列来作图。

但是这个分类效果不好，看不出来各个类别差距，我们还有没有使用的数据，那么我们尝试加上分类的数字，也就是Label代表的数字也体现在图上。

>>> ax.scatter(datingDataMat[:,1], datingDataMat[:,2],
15.0*array(datingLabels), 15.0*array(datingLabels))

这里scatter函数中参数含义如下：

datingDataMat[:,1]：数据的x坐标
datingDataMat[:,2]：数据的y坐标
15.0*array(datingLabels)：每个点对应的大小，这里设置成与标签有关的一个数字，也是为了辨别不同的类别
15.0*array(datingLabels)：每个点对应的颜色，同样设置为与标签有关的一个数字，同一类的标签颜色和大小都是相同的，容易辨别

然后我们加上X坐标轴说明和Y坐标轴说明，显示如下：

添加标签分类信息

然后在使用数据的第一列和第二列来作图

2.4数据归一化

在上文中计算两个向量之间距离的时候采用的是欧式距离：

A(x1, x2, ... , xn)

B(y1, y2, ... , yn)

A、B之间的欧式距离为 $\sqrt{(x1-y1)^{2}+(x2-y2)^{2}+...+(xn-yn)^{2}}$

但是我们发现，如果其中的某一个向量数值非常大，而其他的非常小，则会导致数值非常大的在距离计算上起到了绝对性的作用，这无疑是不合理的，所以就要对所有的数据进行归一化，转换成量度相同的数据，使得他们在距离的计算上取到同等的作用。

对每一组数据处理的方式如下，讲每个数据转化为0-1之间的数值：

newValue = (oldValue-min)/(max-min)

归一化的代码如下：

def autoNorm(dataSet):
	# 找到每一列的最小值，形成一个数组
	minVals = dataSet.min(0)
	# 找到每一列的最大值，形成一个数组
	maxVals = dataSet.max(0)
	# 计算每一列的数据范围，方便下面的转化计算
	ranges = maxVals - minVals
	# 创建dataSet一样大小的矩阵，用于返回归一化之后的值
	normDataSet = zeros(shape(dataSet))
	# 找到原始数据集有多少行，用来扩展向量为矩阵方便进行计算
	m = dataSet.shape[0]
	# 得到一个当前值减去每一列最小值后的矩阵
	normDataSet = dataSet - tile(minVals, (m,1))
	# 除以每一列的数据范围就是进行归一化
	normDataSet = normDataSet/tile(ranges, (m,1))
	# 返回归一化后的数据矩阵， 每一列的数据范围和每一列的最小值
	return normDataSet, ranges, minVals

测试归一化：

>>> normMat, ranges, minVals = kNN.autoNorm(datingDataMat)
>>> normMat
array([[0.44832535, 0.39805139, 0.56233353],
       [0.15873259, 0.34195467, 0.98724416],
       [0.28542943, 0.06892523, 0.47449629],
       ...,
       [0.29115949, 0.50910294, 0.51079493],
       [0.52711097, 0.43665451, 0.4290048 ],
       [0.47940793, 0.3768091 , 0.78571804]])
>>> ranges
array([9.1273000e+04, 2.0919349e+01, 1.6943610e+00])
>>> minVals
array([0.      , 0.      , 0.001156])

2.5验证算法准确性

将数据集分为验证数据集和训练数据集，在选取的时候采取随机选取测试数据集的方式，因为本次实验的数据集本来就是随机的，没有经过编排的，所以可以从数据集的开始进行选取即可。

采用选取整个数据集的10%作为测试数据集，另外90%作为训练数据集。

def datingClassTest():
	hoRatio = 0.10
	datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
	normMat, ranges, minVals = autoNorm(datingDataMat)
	m = normMat.shape[0]
	numTestVecs = int(m*hoRatio)
	errorCount = 0.0
	for i in range(numTestVecs):
		classifierResult = classify0(normMat[i, :],
		                             normMat[numTestVecs:m, :],
		                             datingLabels[numTestVecs:m],
		                             3)
		print("the classifier came back with: %d, the real answer is: %d"% (classifierResult, datingLabels[i]))
		if(classifierResult != datingLabels[i]):
			errorCount += 1.0
	print("the total error rate is: %f" % (errorCount/float(numTestVecs)))

测试结果如下：

>>> import kNN
>>> from imp import reload
>>> kNN.datingClassTest()
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
...
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 1
the total error rate is: 0.050000

2.6完善整个系统

下面采用手工输入数据的方式进行预测。

def classifyPerson():
	resultList = ['not at all','in small doses', 'in large doses']
	percentTats = float(input("percentage of time spent playing video games?"))
	ffMiles = float(input("frequent flier miles earned per year?"))
	iceCream = float(input("liters of ice cream consumed per year?"))
	datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')
	normMat, ranges, minVals = autoNorm(datingDataMat)
	inArr = array([ffMiles, percentTats, iceCream])
	classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
	print("You will probably like this person: ",resultList[classifierResult - 1])

3.手写数字识别系统

3.1数据格式

训练数据在训练数据文件夹digits/trainingDigits/中，测试数据放在测试数据的文件夹digits/testDigits中。

在训练文件夹中大概包含了2000个数据样本，对0-9每个数字大概有200个样本，二测试文件夹中大概包含了900个数据样本。

每个文件夹中的训练样本都是一个txt文件，命名格式：

0_1.txt
# 其中0代表是数字0的样本，1代表是数字0的第二个样本

每个样本数据格式都是30行X32列，数据格式如下：

00000000000111110000000000000000
00000000001111111000000000000000
00000000011111111100000000000000
00000000111111111110000000000000
00000001111111111111000000000000
00000011111110111111100000000000
00000011111100011111110000000000
00000011111100001111110000000000
00000111111100000111111000000000
00000111111100000011111000000000
00000011111100000001111110000000
00000111111100000000111111000000
00000111111000000000011111000000
00000111111000000000011111100000
00000111111000000000011111100000
00000111111000000000001111100000
00000111111000000000001111100000
00000111111000000000001111100000
00000111111000000000001111100000
00000111111000000000001111100000
00000011111000000000001111100000
00000011111100000000011111100000
00000011111100000000111111000000
00000001111110000000111111100000
00000000111110000001111111000000
00000000111110000011111110000000
00000000111111000111111100000000
00000000111111111111111000000000
00000000111111111111110000000000
00000000011111111111100000000000
00000000001111111111000000000000
00000000000111111110000000000000

3.2数据预处理

现在需要对每一个数据样本进行处理，讲每个数据样本转化为一个1*1024的向量，然后把所有的数据样本当成一个训练矩阵。

def img2vector(filename):
	# 首先构造要返回的矩阵1*1024
	returnVect = zeros((1, 1024))
	# 打开文件
	fr = open(filename)
	for i in range(32):
		# 依次读取文件中的３２行
		lineStr = fr.readline()
		for j in range(32):
			# 对于每行的３２位依次放在１＊２４的矩阵中，注意要将每一位字符转化为数字存储
			returnVect[0, i*32+j] = int(lineStr[j])
	return returnVect

整个手写数字的识别系统代码如下：

def handwritingClassTest():
	# 构造手写数字的样本标签列表
	hwLabels = []
	# 要使用listdir函数，首先要导入from os import listdir,作用是列出指定文件夹下所有的文件名，返回一个数组
	trainingFileList = listdir('digits/trainingDigits')
	# 得到训练数据的样本总数
	m = len(trainingFileList)
	# 构造训练数据样本矩阵
	trainingMat = zeros((m, 1024))
	for i in range(m):
		# 拿到每一个训练数据文件的文件名
		fileNameStr = trainingFileList[i]
		# 对文件名进行处理，只拿到文件名，这里先不要后缀
		fileStr = fileNameStr.split('.')[0]
		# 用'_'分割后第一个就是当前训练数据的Label
		classNumStr = int(fileStr.split('_')[0])
		# 将当前训练样本的Label加入到样本标签列表中
		hwLabels.append(classNumStr)
		# 利用之前写好的图像转向量的函数，将每一个文件转换成向量存储到训练数据的矩阵中
		trainingMat[i, :] = img2vector('digits/trainingDigits/%s' % fileNameStr)

	# 得到测试文件夹下所有文件名
	testFileList = listdir('digits/testDigits')
	# errorCount用来记录错误预测的次数
	errorCount = 0.0
	# mTest用来存储训练数据的总数
	mTest = len(testFileList)
	for i in range(mTest):
		# 得到每一个测试数据样本的文件名
		fileNameStr = testFileList[i]
		# 去除后缀，得到文件名部分
		fileStr = fileNameStr.split('.')[0]
		# 得到当前测试数据样本的Label
		classNumStr = int(fileStr.split('_')[0])
		# 使用图像转向量函数将测试数据转换成一个向量
		vectorUnderTest = img2vector('digits/trainingDigits/%s' % fileNameStr)
		# 使用之前的kNN分类函数，对其进行预测分类
		classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
		# 打印出分类结果和真是结果
		print('the classifier came back with: %d, the real answer is: %d' % (classifierResult, classNumStr))
		# 如果预测失误，则errorCount+1
		if classifierResult != classNumStr:
			errorCount += 1.0
	# 打印出错误总数和错误率
	print("\nthe total number of errors is: %d" % errorCount)
	print("\nthe total error rate is: %f" % (errorCount/float(mTest)))

机器学习实战-读书笔记(一)

第二章