This article takes you to understand the K nearest neighbor algorithm - kNN

One: K - Nearest Neighbor Algorithm Description

        The k-nearest neighbor method (k-NN) is a basic classification and regression method proposed by Cover T and Hart P in 1967. K-Nearest Neighbors is probably the easiest algorithm to understand in machine learning, when in fact it doesn't learn at all. Its working principle is: there is a sample data set, also known as a training sample set, and each data in the sample set has a label, that is, we know the correspondence between each data in the sample set and the category to which it belongs. After inputting new data without labels, each feature of the new data is compared with the features corresponding to the data in the sample set, and then the algorithm extracts the classification label of the most similar data (nearest neighbor) of the sample. In general, we only select the top k most similar data in the sample data set, which is the origin of k in the k-nearest neighbor algorithm, usually k is an integer not greater than 20. Finally, the classification with the most occurrences among the k most similar data is selected as the classification of the new data. If there are a lot of features, the learned hypothesis may fit the training set very well (the cost function may be almost 0), but may not generalize to new data. To summarize the steps are:

  1. Calculate the distance between a point in a dataset of known classes and the current point;
  2. Sort in ascending order of distance;
  3. Select k points with the smallest distance from the current point;
  4. Determine the frequency of occurrence of the category in which the first k points belong;
  5. Returns the category with the highest frequency in the first k points as the predicted category for the current point.

        The existing problems are shown in the figure below. Consider a simple two-category problem. If we choose k=3, there will be two samples of category 2 and one sample of category 1. According to the simple voting method, that is, the minority obeys the majority principle, the The new sample is judged to be class 2. But we have to note that although k=3, three samples are included, but the distances between the three samples and our new samples are not consistent, and the samples with closer distances will be more similar, so we can also The distance samples are given different weights. For example, we can take the reciprocal of the distance as the weight, so that the closer the distance is, the greater the contribution to our judgment. In practical problems, different K can be chosen as hyperparameters.

Two: Example 1 - Simple kNN - movie type judgment

Scenario description: The figure below gives the number of fight shots and kissing shots for 4 movies, and then gives a movie (fight shot 10, kissing shot 101) to determine what type of movie it is.

import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy.optimize as opt
data01 = pd.read_csv('knn_data1.txt', names=['kiss','fight','type'])
data01
lovetype = data01[data01.type=='love']
actiontype = data01[data01.type=='action']
fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(lovetype['kiss'], lovetype['fight'], s=50, c='b', marker='o', label='love')
ax.scatter(actiontype['kiss'], actiontype['fight'], s=50, c='r', marker='x', label='action')
ax.legend()
ax.set_xlabel('kiss number')
ax.set_ylabel('fight number')

input = [101,10]
ax.scatter(input[0],input[1],s=50,c='g',marker='.', label='test')
plt.show()

 

def classify_1(input, data, K):    #[101,20]
    datax = data.iloc[:, :-1].as_matrix()   #取前两列数据
    dataSize = datax.shape[0]     # dataSize = 4
    ####计算欧式距离
    diff = np.tile(input,(dataSize,1)) - datax  #diff = array([[11, 7], [13,5], [94,-91], [92,-78]])
    sqdiff = diff ** 2  #sqdiff = array([[121,49], [169, 25],[8836, 8281], [8464, 6084]])
    squareDist = np.sum(sqdiff,axis = 1)###行向量分别相加,[  170,   194, 17117, 14548]
dist = squareDist ** 0.5  #[ 13.03840481,  13.92838828, 130.83195328, 120.61509027]

    ####对距离进行排序
    sortedDistIndex = np.argsort(dist)##argsort()根据元素的值从大到小对元素进行排序,返回下标,{0,1,3,2}

    ####计数
    classCount={}
    for i in range(K):
        voteLabel = data.type[sortedDistIndex[i]]
        ###对选取的K个样本所属的类别个数进行统计
        classCount[voteLabel] = classCount.get(voteLabel,0) + 1
        # classCount = {'action': 1, 'love': 2}

    #取出最大的数据
    maxCount = 0
    for key,value in classCount.items():
        if value > maxCount:
            maxCount = value
            classes = key
    return classes
test01 = [101,20]
test_class = classify_1(test01, data01, 3)
print(test_class)              #love

Three: Example 2 - Complex kNN (super 2 -dimensional data + normalization) - website dating judgment

The previous example is relatively simple, so it is omitted. This time the more general steps are given:

  1. Collect data: You can use crawlers to collect data, or you can use free or paid data provided by third parties. Generally speaking, the data is placed in a txt text file and stored in a certain format for easy analysis and processing.
  2. Prepare data: Parse, preprocess data using Python.
  3. Analyzing data: Data can be analyzed in many ways, such as using Matplotlib to visualize data.
  4. Test Algorithm: Calculate the error rate.
  5. Using the algorithm: If the error rate is within an acceptable range, you can run the k-nearest neighbor algorithm for classification.

After that is a more complex example:

Scenario description:

        Ms. Helen has been using online dating sites to find a date that suits her. Although dating sites suggest different options, she doesn't like everyone. After some summarization, she found that the people she has been with can be divided into three categories: dislike, a little like and very like. Dimensions used include:

        Annual frequent flyer miles earned, percentage of time spent playing video games and liters of ice cream consumed per week. ( Following the code above )

#(1)input
fr = open('ex3data2.txt','r')
arrayOLines = fr.readlines()  #读取文件所有内容            
numberOfLines = len(arrayOLines)   #得到文件行数
returnMat = np.zeros((numberOfLines,3))	#返回的NumPy矩阵,解析完成的数据:numberOfLines行,3列
classLabelVector = []	#返回的分类标签向量
index = 0 	#行的索引值

for line in arrayOLines:
    line = line.strip()	#s.strip(rm),当rm空时,默认删除空白符(包括'\n','\r','\t',' ')
    listFromLine = line.split('\t')	#使用s.split(str="",num=string,cout(str))将字符串根据'\t'分隔符进行切片。        
    returnMat[index,:] = listFromLine[0:3]	#将数据前三列提取出来,存放到returnMat矩阵中,也就是特征矩阵
    #根据文本中标记的喜欢的程度进行分类,1代表不喜欢,2代表魅力一般,3代表极具魅力  
    if listFromLine[-1] == 'didntLike':
        classLabelVector.append(1)
    elif listFromLine[-1] == 'smallDoses':
        classLabelVector.append(2)
    elif listFromLine[-1] == 'largeDoses':
        classLabelVector.append(3)
    index += 1
#(2)picture
fig, axs = plt.subplots(nrows=2, ncols=2,sharex=False, sharey=False, figsize=(13,8))
numberOfLabels = len(classLabelVector)
LabelsColors = []
for i in classLabelVector:
	if i == 1: 
		LabelsColors.append('black') #didntLike
	if i == 2:
		LabelsColors.append('orange') #smallDoses
	if i == 3:
		LabelsColors.append('red') #largeDoses

#画出散点图,以datingDataMat矩阵的第一(飞行常客例程)、第二列(玩游戏)数据画散点数据,散点大小为15,透明度为0.5
axs[0][0].scatter(x=returnMat[:,0], y=returnMat[:,1], color=LabelsColors,s=15, alpha=.5)
#设置标题,x轴label,y轴label
axs0_xlabel_text = axs[0][0].set_xlabel(u'fly distance')
axs0_ylabel_text = axs[0][0].set_ylabel(u'game time')

#画出散点图,以datingDataMat矩阵的第一(飞行常客例程)、第三列(冰激凌)数据画散点数据,散点大小为15,透明度为0.5
axs[0][1].scatter(x=returnMat[:,0], y=returnMat[:,2], color=LabelsColors,s=15, alpha=.5)
#设置标题,x轴label,y轴label
axs1_xlabel_text = axs[0][1].set_xlabel(u'fly distance')
axs1_ylabel_text = axs[0][1].set_ylabel(u'icecream mount')

#画出散点图,以datingDataMat矩阵的第二(玩游戏)、第三列(冰激凌)数据画散点数据,散点大小为15,透明度为0.5
axs[1][0].scatter(x=returnMat[:,1], y=returnMat[:,2], color=LabelsColors,s=15, alpha=.5)
#设置标题,x轴label,y轴label
axs2_xlabel_text = axs[1][0].set_xlabel(u'game time')
axs2_ylabel_text = axs[1][0].set_ylabel(u'icecream mount')

plt.show()

 

#(3)构架kNN
def classify_2(inX, dataSet, labels, k):
	#numpy函数shape[0]返回dataSet的行数
	dataSetSize = dataSet.shape[0]
	#在列向量方向上重复inX共1次(横向),行向量方向上重复inX共dataSetSize次(纵向)
	diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
	#二维特征相减后平方
	sqDiffMat = diffMat**2
	#sum()所有元素相加,sum(0)列相加,sum(1)行相加
	sqDistances = sqDiffMat.sum(axis=1)
	#开方,计算出距离
	distances = sqDistances**0.5
	#返回distances中元素从小到大排序后的索引值
	sortedDistIndices = distances.argsort()
	#定一个记录类别次数的字典
	classCount = {}
	for i in range(k):
		#取出前k个元素的类别
		voteIlabel = labels[sortedDistIndices[i]]
		#dict.get(key,default=None),字典的get()方法,返回指定键的值,如果值不在字典中返回默认值。
		#计算类别次数
		classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
	#python3中用items()替换python2中的iteritems()
	#key=operator.itemgetter(1)根据字典的值进行排序
	#key=operator.itemgetter(0)根据字典的键进行排序
	#reverse降序排序字典
	sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1),reverse=True)
	print(sortedClassCount)
	#返回次数最多的类别,即所要分类的类别
	return sortedClassCount[0][0]

After that, the data is normalized, if according to the previous formula:

Under the square root ((0-67)²+(20000-32000)²+(1.1-0.1)²)

        It is easy to find that the attribute with the largest numerical difference in the above equation has the greatest influence on the calculation result, that is to say, the influence of the annual frequent flyer miles on the calculation result will be much greater than the other two characteristics in Table 2.1 - Play Video The proportion of time spent in the game and the impact of the number of kilograms of ice cream consumed per week. And the only reason for this phenomenon is simply because frequent flyer miles are much larger than other characteristic values. But Helen believes that the three features are equally important, so as one of the three equally weighted features, frequent flyer miles shouldn't affect the calculation so heavily.

       When dealing with eigenvalues ​​with different value ranges, the method we usually use is to normalize the values, such as processing the value range to be between 0 and 1 or -1 and 1. The following formula can convert eigenvalues ​​of any value range into values ​​in the interval 0 to 1:

newValue = (oldValue - min) / (max - min)

#归一化
def autoNorm(dataSet):
	#获得数据的最小值
	minVals = dataSet.min(0)
	maxVals = dataSet.max(0)
	#最大值和最小值的范围
	ranges = maxVals - minVals
	#shape(dataSet)返回dataSet的矩阵行列数
	normDataSet = np.zeros(np.shape(dataSet))
	#返回dataSet的行数
	m = dataSet.shape[0]
	#原始值减去最小值
	normDataSet = dataSet - np.tile(minVals, (m, 1))
	#除以最大和最小值的差,得到归一化数据
	normDataSet = normDataSet / np.tile(ranges, (m, 1))
	#返回归一化数据结果,数据范围,最小值
	return normDataSet, ranges, minVals
#测试准确率
def datingClassTest():
	#取所有数据的百分之十
	hoRatio = 0.10
	#数据归一化,返回归一化后的矩阵,数据范围,数据最小值
	normMat, ranges, minVals = autoNorm(returnMat)
	#获得normMat的行数
	m = normMat.shape[0]
	#百分之十的测试数据的个数
	numTestVecs = int(m * hoRatio)
	#分类错误计数
	errorCount = 0.0

	for i in range(numTestVecs):
		#前numTestVecs个数据作为测试集,后m-numTestVecs个数据作为训练集
		classifierResult = classify_2(normMat[i,:], normMat[numTestVecs:m,:], classLabelVector[numTestVecs:m], 5)
		if classifierResult != classLabelVector[i]:
			errorCount += 1.0
	print("compute result:%s\t real result:%d" % (numTestVecs-errorCount, numTestVecs))
#test
resultList = ['tired of','a little like','very like']
#三维特征用户输入
precentTats = 15
ffMiles = 100
iceCream = 1
#训练集归一化
normMat, ranges, minVals = autoNorm(returnMat)
#生成NumPy数组,测试集
inArr = np.array([ffMiles, precentTats, iceCream])
#测试集归一化
norminArr = (inArr - minVals) / ranges
#返回分类结果
classifierResult = classify_2(norminArr, normMat, classLabelVector, 3)
#打印结果
print("You may %s this man." % (resultList[classifierResult-1]))
print(datingClassTest())

input data 15,100,1

Got the following result:

As can be seen

 The results are good, I used K=5 in the test accuracy, if K=4, then the accuracy can be improved to 97%! 

Summary 1: Advantages of KNN (high accuracy, insensitivity to outliers, no data input assumption)/disadvantages (computational and space complexity).

Summary 2: The data range of KNN (numeric and nominal), K is generally an integer not greater than 20 .

The data in this article comes from the classroom, if there is any similarity, please make a change!

Guess you like

Origin blog.csdn.net/yyfloveqcw/article/details/123964223