Chapter 2 k-Nearest Neighbor Algorithm
Overview of KNN
k-近邻(kNN, k-NearestNeighbor)算法是一种基本分类与回归方法,我们这里只讨论分类问题中的 k-近邻算法。
To sum it up in one sentence: Those who are close to red are red and those who are close to ink are black!
k 近邻算法的输入为实例的特征向量,对应于特征空间的点;输出为实例的类别,可以取多类。k 近邻算法假设给定一个训练数据集,其中的实例类别已定。分类时,对新的实例,根据其 k 个最近邻的训练实例的类别,通过多数表决等方式进行预测。因此,k近邻算法不具有显式的学习过程。
k 近邻算法实际上利用训练数据集对特征向量空间进行划分,并作为其分类的“模型”。 k值的选择、距离度量以及分类决策规则是k近邻算法的三个基本要素。
KNN scenario
Movies can be classified according to their themes, so how to distinguish between 动作片
and 爱情片
?
- Action movies: more fights
- Romance movies: more kisses
Based on the number of kisses and fights in the movie, using the k-nearest neighbor algorithm to construct a program, the subject type of the movie can be automatically divided.
现在根据上面我们得到的样本集中所有电影与未知电影的距离,按照距离递增排序,可以找到 k 个距离最近的电影。
假定 k=3,则三个最靠近的电影依次是, He's Not Really into Dudes 、 Beautiful Woman 和 California Man。
knn 算法按照距离最近的三部电影的类型,决定未知电影的类型,而这三部电影全是爱情片,因此我们判定未知电影是爱情片。
KNN principle
How KNN works
- Suppose there is a labeled sample data set (training sample set), which contains the correspondence between each piece of data and its category.
- After inputting new data without labels, compare each feature of the new data with the corresponding feature of the data in the sample set.
- Calculate the distance between the new data and each piece of data in the sample data set.
- Sort all the distances obtained (from small to large, smaller means more similar).
- Get the classification labels corresponding to the first k (k is generally less than or equal to 20) sample data.
- Find the classification label that appears most frequently among the k pieces of data as the classification of the new data.
KNN popular understanding
Given a training data set, for a new input instance, find the k instances closest to the instance in the training data set. If most of these k instances belong to a certain class, the input instance is classified into this class.
KNN development process
收集数据:任何方法
准备数据:距离计算所需要的数值,最好是结构化的数据格式
分析数据:任何方法
训练算法:此步骤不适用于 k-近邻算法
测试算法:计算错误率
使用算法:输入样本数据和结构化的输出结果,然后运行 k-近邻算法判断输入数据分类属于哪个分类,最后对计算出的分类执行后续处理
KNN algorithm characteristics
优点:精度高、对异常值不敏感、无数据输入假定
缺点:计算复杂度高、空间复杂度高
适用数据范围:数值型和标称型
KNN project case
Project Case 1: Optimizing the Matching Effect of Dating Websites
Project Overview
Helen uses a dating website to find a date. After a period of time, she discovered that she had dated three types of people:
- people who don't like
- mediocre charming person
- Very charming person
She hopes:
- Weekday dates with averagely attractive people
- Weekend date with a very attractive person
- Those who don't like it will be eliminated directly.
Now she has collected data that dating sites don't record, which can help categorize matches.
Development Process
收集数据:提供文本文件
准备数据:使用 Python 解析文本文件
分析数据:使用 Matplotlib 画二维散点图
训练算法:此步骤不适用于 k-近邻算法
测试算法:使用海伦提供的部分数据作为测试样本。
测试样本和非测试样本的区别在于:
测试样本是已经完成分类的数据,如果预测分类与实际类别不同,则标记为一个错误。
使用算法:产生简单的命令行程序,然后海伦可以输入一些特征数据以判断对方是否为自己喜欢的类型。
Collect data: Provide text file
Helen stores the data of these dating partners in the text file datingTestSet2.txt , with a total of 1,000 lines. Helen’s dating partners mainly include the following three characteristics:
- Frequent flyer miles earned per year
- Percent of time spent playing video games
- Liters of ice cream consumed per week
The text file data format is as follows:
40920 8.326976 0.953952 3
14488 7.153469 1.673904 2
26052 1.441871 0.805124 1
75136 13.147394 0.428964 1
38344 1.669788 0.134296 1
Preparing the data: Parsing text files using Python
Parser for converting text records to NumPy
def file2matrix(filename):
"""
Desc:
导入训练数据
parameters:
filename: 数据文件路径
return:
数据矩阵 returnMat 和对应的类别 classLabelVector
"""
fr = open(filename)
# 获得文件中的数据行的行数
numberOfLines = len(fr.readlines())
# 生成对应的空矩阵
# 例如:zeros(2,3)就是生成一个 2*3的矩阵,各个位置上全是 0
returnMat = zeros((numberOfLines, 3)) # prepare matrix to return
classLabelVector = [] # prepare labels return
fr = open(filename)
index = 0
for line in fr.readlines():
# str.strip([chars]) --返回已移除字符串头尾指定字符所生成的新字符串
line = line.strip()
# 以 '\t' 切割字符串
listFromLine = line.split('\t')
# 每列的属性数据
returnMat[index, :] = listFromLine[0:3]
# 每列的类别数据,就是 label 标签数据
classLabelVector.append(int(listFromLine[-1]))
index += 1
# 返回数据矩阵returnMat和对应的类别classLabelVector
return returnMat, classLabelVector
Analyze data: use Matplotlib to draw a two-dimensional scatter plot
import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:, 0], datingDataMat[:, 1], 15.0*array(datingLabels), 15.0*array(datingLabels))
plt.show()
In the figure below, the first and second column attributes of the matrix are used to achieve a good display effect. Three different sample classification areas are clearly identified. People with different hobbies also have different category areas.
- Normalized data (normalization is a process of making weights uniform, please refer to: https://www.zhihu.com/question/19951858 for more details )
serial number | Percent of time spent playing video games | Frequent flyer miles earned per year | Liters of ice cream consumed per week | Sample classification |
---|---|---|---|---|
1 | 0.8 | 400 | 0.5 | 1 |
2 | 12 | 134 000 | 0.9 | 3 |
3 | 0 | 20 000 | 1.1 | 2 |
4 | 67 | 32 000 | 0.1 | 2 |
The distance between sample 3 and sample 4:
( 0 − 67 ) 2 + ( 20000 − 32000 ) 2 + ( 1.1 − 0.1 ) 2 \sqrt{(0-67)^2 + (20000-32000)^2 + (1.1- 0.1)^2 }(0−67)2+(20000−32000)2+(1.1−0.1)2
Normalize feature values to eliminate the influence caused by different magnitudes between features
Normalization definition: I think so. Normalization is to limit the data you need to process (through a certain algorithm) to a certain range you need. First of all, normalization is for the convenience of subsequent data processing, and secondly, it is to speed up the convergence when the correction program is running. The methods are as follows:
-
Linear function conversion, the expression is as follows:
y=(x-MinValue)/(MaxValue-MinValue)
Note: x and y are the values before and after conversion respectively, MaxValue and MinValue are the maximum and minimum values of the sample respectively.
-
Logarithmic function conversion, the expression is as follows:
y=log10(x)
Description: Logarithmic function conversion with base 10.
As shown in the picture:
-
Inverse cotangent function conversion, the expression is as follows:
y=arctan(x)*2/PI
As shown in the picture:
- Equation (1) converts the input value into a value in the [-1,1] interval, and uses Equation (2) at the output layer to convert it back to the initial value, where and represent the maximum and minimum values of the load in the training sample set respectively.
In statistics, the specific role of normalization is to summarize the statistical distribution of a unified sample. Normalization between 0-1 is a statistical probability distribution, and normalization between -1–+1 is a statistical coordinate distribution.
def autoNorm(dataSet):
"""
Desc:
归一化特征值,消除特征之间量级不同导致的影响
parameter:
dataSet: 数据集
return:
归一化后的数据集 normDataSet. ranges和minVals即最小值与范围,并没有用到
归一化公式:
Y = (X-Xmin)/(Xmax-Xmin)
其中的 min 和 max 分别是数据集中的最小特征值和最大特征值。该函数可以自动将数字特征值转化为0到1的区间。
"""
# 计算每种属性的最大值、最小值、范围
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
# 极差
ranges = maxVals - minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
# 生成与最小值之差组成的矩阵
normDataSet = dataSet - tile(minVals, (m, 1))
# 将最小值之差除以范围组成矩阵
normDataSet = normDataSet / tile(ranges, (m, 1)) # element wise divide
return normDataSet, ranges, minVals
Training Algorithm: This step does not apply to the k-nearest neighbor algorithm
Because the test data must be compared with the full training data every time, this process is not necessary.
kNN algorithm pseudo code:
对于每一个在数据集中的数据点:
计算目标的数据点(需要分类的数据点)与该数据点的距离
将距离排序:从小到大
选取前K个最短距离
选取这K个中最多的分类类别
返回该类别来作为目标数据点的预测值
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
#距离度量 度量公式为欧氏距离
diffMat = tile(inX, (dataSetSize,1)) – dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
#将距离排序:从小到大
sortedDistIndicies = distances.argsort()
#选取前K个最短距离, 选取这K个中最多的分类类别
classCount={
}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
Test algorithm: Use some of the data provided by Helen as test samples. If the predicted class is different from the actual class, it is marked as an error.
kNN classifier test code for dating website
def datingClassTest():
"""
Desc:
对约会网站的测试方法
parameters:
none
return:
错误数
"""
# 设置测试数据的的一个比例(训练数据集比例=1-hoRatio)
hoRatio = 0.1 # 测试范围,一部分测试一部分作为样本
# 从文件中加载数据
datingDataMat, datingLabels = file2matrix('db/2.KNN/datingTestSet2.txt') # load data setfrom file
# 归一化数据
normMat, ranges, minVals = autoNorm(datingDataMat)
# m 表示数据的行数,即矩阵的第一维
m = normMat.shape[0]
# 设置测试的样本数量, numTestVecs:m表示训练样本的数量
numTestVecs = int(m * hoRatio)
print 'numTestVecs=', numTestVecs
errorCount = 0.0
for i in range(numTestVecs):
# 对数据测试
classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])
if (classifierResult != datingLabels[i]): errorCount += 1.0
print "the total error rate is: %f" % (errorCount / float(numTestVecs))
print errorCount
Use algorithm: Generate a simple command line program, and then Helen can input some characteristic data to determine whether the other party is the type she likes.
Dating website prediction function
def classifyPerson():
resultList = ['not at all', 'in small doses', 'in large doses']
percentTats = float(raw_input("percentage of time spent playing video games ?"))
ffMiles = float(raw_input("frequent filer miles earned per year?"))
iceCream = float(raw_input("liters of ice cream consumed per year?"))
datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
normMat, ranges, minVals = autoNorm(datingDataMat)
inArr = array([ffMiles, percentTats, iceCream])
classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels, 3)
print "You will probably like this person: ", resultList[classifierResult - 1]
The actual operation effect is as follows:
>>> classifyPerson()
percentage of time spent playing video games?10
frequent flier miles earned per year?10000
liters of ice cream consumed per year?0.5
You will probably like this person: in small doses
Project Case 2: Handwritten Number Recognition System
Project Overview
Construct a handwritten digit recognition system based on KNN classifier that can recognize the digits 0 to 9.
The numbers to be recognized are black and white images stored in a text file with the same color and size: width and height are 32 pixels * 32 pixels.
Development Process
收集数据:提供文本文件。
准备数据:编写函数 img2vector(), 将图像格式转换为分类器使用的向量格式
分析数据:在 Python 命令提示符中检查数据,确保它符合要求
训练算法:此步骤不适用于 KNN
测试算法:编写函数使用提供的部分数据集作为测试样本,测试样本与非测试样本的
区别在于测试样本是已经完成分类的数据,如果预测分类与实际类别不同,
则标记为一个错误
使用算法:本例没有完成此步骤,若你感兴趣可以构建完整的应用程序,从图像中提取
数字,并完成数字识别,美国的邮件分拣系统就是一个实际运行的类似系统
Collect data: Provide text file
The directory trainingDigits contains about 2000 examples. The content of each example is as shown in the figure below. Each number has about 200 samples; the directory testDigits contains about 900 test data.
Prepare data: Write function img2vector() to convert image text data into vectors used by the classifier
Convert image text data to vector
def img2vector(filename):
returnVect = zeros((1,1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVect[0,32*i+j] = int(lineStr[j])
return returnVect
Analyze the data: Check the data in the Python command prompt to make sure it meets the requirements
Test the img2vector function by entering the following command on the Python command line and compare it to a file opened in a text editor:
>>> testVector = kNN.img2vector('testDigits/0_13.txt')
>>> testVector[0,0:32]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> testVector[0,32:64]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
Training Algorithm: This step does not apply to KNN
Because the test data must be compared with the full training data every time, this process is not necessary.
Test the algorithm: Write a function that uses a portion of the provided dataset as a test sample and flags it as an error if the predicted class is different from the actual class
def handwritingClassTest():
# 1. 导入训练数据
hwLabels = []
trainingFileList = listdir('db/2.KNN/trainingDigits') # load the training set
m = len(trainingFileList)
trainingMat = zeros((m, 1024))
# hwLabels存储0~9对应的index位置, trainingMat存放的每个位置对应的图片向量
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr = fileNameStr.split('.')[0] # take off .txt
classNumStr = int(fileStr.split('_')[0])
hwLabels.append(classNumStr)
# 将 32*32的矩阵->1*1024的矩阵
trainingMat[i, :] = img2vector('db/2.KNN/trainingDigits/%s' % fileNameStr)
# 2. 导入测试数据
testFileList = listdir('db/2.KNN/testDigits') # iterate through the test set
errorCount = 0.0
mTest = len(testFileList)
for i in range(mTest):
fileNameStr = testFileList[i]
fileStr = fileNameStr.split('.')[0] # take off .txt
classNumStr = int(fileStr.split('_')[0])
vectorUnderTest = img2vector('db/2.KNN/testDigits/%s' % fileNameStr)
classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
if (classifierResult != classNumStr): errorCount += 1.0
print "\nthe total number of errors is: %d" % errorCount
print "\nthe total error rate is: %f" % (errorCount / float(mTest))
Using the algorithm: This step is not completed in this example. If you are interested, you can build a complete application to extract numbers from images and complete number recognition. The mail sorting system in the United States is a similar system that actually operates.
KNN summary
What is KNN? Definition: Supervised learning? Unsupervised learning?
KNN is a simple non-explicit learning process and a non-generalization learning supervised learning model. It has applications in both classification and regression.
Fundamental
To put it simply: Calculate the distance between the query point and each training data point through distance measurement, then select the K nearest neighbors that are close to the query point (query point), and use classification decision-making To select the corresponding label as the label of the query point.
KNN three elements
K, the value of K
It has a significant impact on query point labels (outstanding results). When the k value is small, the approximation error is small and the estimation error is large. When the k value is large, the approximation error is large and the estimation error is small.
If you choose a smaller k value, it is equivalent to using training instances in a smaller neighborhood for prediction. The approximation error of "learning" will be reduced, and only training examples that are closer (similar) to the input instance will be made. Instances will play a role in the prediction results. But the disadvantage is that the estimation error of "learning" will increase, and the prediction results will be very sensitive to nearby instance points. If nearby instance points happen to be noise, the prediction will be wrong. In other words, the reduction of k value means that the overall model becomes complex and prone to overfitting.
If you choose a larger k value, it is equivalent to using training instances in a larger neighborhood to make predictions. The advantage is that it can reduce the estimation error of learning. But the disadvantage is that the learning approximation error will increase. At this time, training instances that are far away from the input instance (dissimilar) will also affect the prediction, causing the prediction to be wrong. An increase in the k value means that the overall model becomes simpler.
Neither too big nor too small is good. You can use cross validation to select a suitable k value.
Approximation error and estimation error, please see here: https://www.zhihu.com/question/60793482
Metric/Distance Measure
The distance metric is usually Euclidean distance, but it can also be Minkowski distance or Manhattan distance. It can also be some distance formula in geographical space. (For more details, please refer to the valid_metric section in sklearn)
classification decision (decision rule)
In classification problems, the classification decision is usually to select the label with the most votes through majority rule. In regression problems, it is usually the average of the labels of the K nearest neighbors.
Algorithm: (there are three types on sklearn)
Brute Force brute force calculation/linear scan
KD Tree uses a binary tree to bisect the parameter space according to the data dimensions.
Ball Tree uses a series of hyperspheres to bisect the training data set.
Tree structure algorithms have two processes: tree building and query. Brute Force has no building process.
Algorithm features:
优点: High Accuracy, No Assumption on data, not sensitive to outliers
Disadvantages: high time and space complexity
Scope of application: continuous values and nominal values
Similar homologous products:
radius neighbors Find neighbors based on the specified radius
Factors affecting the algorithm:
N is the number of samples in the data set, D is the data dimension (number of features)
Total consumption:
Brute Force: O[DN^2]
What is considered here is the stupidest method: counting the distances between all training points. Of course, there are faster implementations, such as O(ND + kN) and O(NDK), and the fastest is O[DN]. If you are interested, you can read this link: k-NN computational complexity
KD Tree: O[DN log(N)]
Ball Tree: O[DN log(N)] is in the same order of magnitude as KD Tree. Although the tree construction time will be longer than KD Tree, the query speed is greatly improved in highly structured data, even high-latitude data. promote.
Query required consumption:
Brute Force: O[DN]
KD Tree: When the dimension is relatively small, such as D<20, O[Dlog(N)]. On the contrary, it will tend to O[DN]
Ball Tree: O[Dlog(N)]
When the data set is relatively small, such as N<30, Brute Force has more advantages.
Intrinsic Dimensionality and Sparsity
The intrinsic dimensionality of data refers to the dimension d < D of the manifold where the data is located, which can be linear or nonlinear in the parameter space. Sparsity refers to the degree to which the data fills the parameter space (this is different from the concept used in "sparse" matrices, a data matrix may not have zero entries, but in this sense its structure is still "sparse").
Query times for Brute Force are not affected.
For the query time of KD Tree and Ball Tree, the query time of data sets with smaller intrinsic dimensions and sparser is faster. The improvement of KD Tree is not as significant as that of Ball Tree due to its own characteristics of bisecting the parameter space through the coordinate axis.
The value of k (k neighboring points)
Brute Force's query time is basically unaffected.
But for KD Tree and Ball Tree, the larger k is, the slower the query time is.
k When N accounts for a large proportion, it is better to use Brute Force.
Number of Query Points (number of query points, that is, the number of test data)
Brute Force is used when there are few query points. When there are many query points, the tree structure algorithm can be used.
Some additional information about models in sklearn:
If the application scenarios of KD Tree, Ball Tree and Brute Force are unclear, you can directly use the module containing algorithm='auto'. algorithm='auto' automatically selects the optimal algorithm for you.
There are regressor and classifier to choose from.
metric/distance measure is optional. In addition, the distance can be weighted by weight.
The impact of leaf size on KD Tree and Ball Tree
Tree establishment time: When the leaf size is larger, the tree establishment time will be faster.
Query time: leaf size is not good if it is too large or too small. If the leaf size tends to N (the number of samples in the training data), the algorithm is actually brute force. If the leaf size is too small, tending to 1, then the time to traverse the tree during query will be greatly increased. The recommended value for leaf size is 30, which is the default value.
Memory: The leaf size becomes larger, and the memory for storing the tree structure becomes smaller.
Nearest Centroid Classifier
The classification decision is which label's centroid is closest to the test point, which label is selected.
The model assumes equal variance in all dimensions. It's a good base line.
Advanced version: Nearest Shrunken Centroid
Can be set via shrink_threshold.
Function: Can remove certain features that affect classification, such as removing the impact of noise features
- Reference information comes from ApacheCN