1.K-近邻算法的优缺点
- 采用测量特征值间的距离的方法进行分类
- 优点在于,精度高,无数据输入假定,对异常值不敏感
- 缺点在于,计算复杂度和空间复杂度高
- 适用于数值型,标称型
2.KNN工作原理
- 存在训练样本集,样本集中每个数据都有标签及所属分类
- 新数据输入时,选择样本数据集中前k个(通常k不大于20)距离最近(最相似)的数据
- 统计k个数据中的分类数量,选择次数最多的分类分配给新数据
3.构建一个测试用的分类器
- 仅用于测试分类器在本机环境下是否可用,python3.6.5环境下pycharm2018.1.2
import operator
from numpy import tile, sqrt, array
def createDataSet():
arr = [[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]]
group = array(arr)
labels = ['A', 'A', 'B', 'B']
return group, labels
group, labels = createDataSet()
def classify0(inData, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inData, (dataSetSize, 1)) - dataSet
sqDiffMat = pow(diffMat, 2)
sqDistances = sqDiffMat.sum(axis=1)
distances = sqrt(sqDistances)
sortedDistIndicies = distances.argsort()
classCount = {}
for i in range(k):
votelabel = labels[sortedDistIndicies[i]]
classCount[votelabel] = classCount.get(votelabel, 0) + 1
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
r = classify0([0.0], group, labels, 3)
print(r)
4.为避免各维度单位不同导致的权重不同,要进行归一化处理
from numpy import tile
def autoNorm(dataSet):
minVals=dataSet.min(0)
maxVals = dataSet.max(0)
ranges=maxVals-minVals
m=dataSet.shape[0]
normDataSet=dataSet-tile(minVals,(m,1))
normDataSet=normDataSet/tile(ranges,(m,1))
return normDataSet,ranges,minVals
5.从本地文件中读取训练样本集的数据,格式为每行四组数字,以\t隔开
from numpy.ma import zeros, array
from demo.归一化特征值 import autoNorm
def file2matrix(filename):
fr = open(filename)
arrayOLines = fr.readlines()
numberOfLines = len(arrayOLines)
returnMat = zeros((numberOfLines, 3))
classLabelVector = []
index = 0
for line in arrayOLines:
line = line.strip()
listFromLine = line.split('\t')
returnMat[index, :] = listFromLine[0:3]
classLabelVector.append(int(listFromLine[-1]))
index += 1
return returnMat, classLabelVector
6.测试KNN算法的错误率,以90%作训练样本,10%作测试数据
from demo.归一化特征值 import autoNorm
from demo.读取文件 import file2matrix
from demo.KNN import classify0
def datingClassTest():
hoRatio=0.1
datamat,labels=file2matrix('demo.txt')
normMat, ranges, minVals = autoNorm(datamat)
m=normMat.shape[0]
numTestVecs=int(m*hoRatio)
errorCount=0.0
for i in range(numTestVecs):
classResult=classify0(normMat[i,:],normMat[numTestVecs:m,:],labels[numTestVecs:m],3)
print('测试:{},实际:{}'.format(classResult,labels[i]))
if classResult != labels[i]:
errorCount += 1
print('错误率:{}'.format(errorCount/float(numTestVecs)))
7.实际使用KNN算法,输入任意3组数字,返回预测的分类结果
from numpy.ma import array
from demo.读取文件 import file2matrix
from demo.归一化特征值 import autoNorm
from demo.KNN import classify0
def classifyPerson():
resultList = ['讨厌','一般','喜欢']
data1 = float(input('输入第一列数据'))
data2 = float(input('输入第二列数据'))
data3 = float(input('输入第三列数据'))
datamat,labels=file2matrix('demo.txt')
normmat,ranges,minvals = autoNorm(datamat)
inarr = array([data1,data2,data3])
classresult=classify0((inarr-minvals)/ranges,normmat,labels,3)
print('可能的结果是{}'.format(resultList[classresult-1]))
8.示例2:图像识别手写数字0~9,添加一个方法读取本地生成好的txt,进行图像识别错误率的测试
def img2vecotr(filename):
returnVect = zeros((1,1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVect[0,32*i+j] = int(lineStr[j])
return returnVect
9.利用作者提供的测试集和训练集测试KNN算法的错误率
def handTest():
hwLabels = []
trainList = listdir('E:\machinelearninginaction\Ch02\digits\\trainingDigits')
m = len(trainList)
trainMat = zeros((m,1024))
for i in range(m):
filenameStr = trainList[i]
fileStr = filenameStr.split('.')[0]
classNumStr = int(fileStr.split('_')[0])
hwLabels.append(classNumStr)
trainMat[i,:] = img2vecotr(f'E:\machinelearninginaction\Ch02\digits\\trainingDigits\{filenameStr}')
testList = listdir('E:\machinelearninginaction\Ch02\digits\\trainingDigits')
errorcount=0.0
mTest = len(testList)
for i in range(mTest):
filenameStr = trainList[i]
fileStr = filenameStr.split('.')[0]
classNumStr = int(fileStr.split('_')[0])
vecotrTest = img2vecotr(f'E:\machinelearninginaction\Ch02\digits\\trainingDigits\{filenameStr}')
ifierResult = classify0(vecotrTest,trainMat,hwLabels,3)
print(f'测试:{ifierResult},实际:{classNumStr}')
if (ifierResult != classNumStr):
errorcount += 1
print('错误率为:{}'.format(errorcount/float(mTest)))
10.KNN算法的局限性
- 执行效率低,在图像识别时,要对每个测试向量做2000次的欧氏距离计算,每次包含1024个维度的浮点运行,总计900次才能得到测试结果
- KNN为基于实例的学习,必须存在接近实际数据的训练样本数据
- 必须保存全部数据集,存储空间消耗大
- 无法给出数据的基础结构信息,无法获得样本的特征