Machine Learning in Practice Chapter 2 k-Nearest Neighbor Algorithm

Chapter 2 k-Nearest Neighbor Algorithm

Overview of KNN

k-近邻(kNN, k-NearestNeighbor)算法是一种基本分类与回归方法,我们这里只讨论分类问题中的 k-近邻算法。

To sum it up in one sentence: Those who are close to red are red and those who are close to ink are black!

k 近邻算法的输入为实例的特征向量,对应于特征空间的点;输出为实例的类别,可以取多类。k 近邻算法假设给定一个训练数据集,其中的实例类别已定。分类时,对新的实例,根据其 k 个最近邻的训练实例的类别,通过多数表决等方式进行预测。因此,k近邻算法不具有显式的学习过程。

k 近邻算法实际上利用训练数据集对特征向量空间进行划分,并作为其分类的“模型”。 k值的选择、距离度量以及分类决策规则是k近邻算法的三个基本要素。

KNN scenario

Movies can be classified according to their themes, so how to distinguish between 动作片and 爱情片?

  1. Action movies: more fights
  2. Romance movies: more kisses

Based on the number of kisses and fights in the movie, using the k-nearest neighbor algorithm to construct a program, the subject type of the movie can be automatically divided.

Movie video case

现在根据上面我们得到的样本集中所有电影与未知电影的距离,按照距离递增排序,可以找到 k 个距离最近的电影。
假定 k=3,则三个最靠近的电影依次是, He's Not Really into Dudes 、 Beautiful Woman 和 California Man。
knn 算法按照距离最近的三部电影的类型,决定未知电影的类型,而这三部电影全是爱情片,因此我们判定未知电影是爱情片。

KNN principle

How KNN works

  1. Suppose there is a labeled sample data set (training sample set), which contains the correspondence between each piece of data and its category.
  2. After inputting new data without labels, compare each feature of the new data with the corresponding feature of the data in the sample set.
    1. Calculate the distance between the new data and each piece of data in the sample data set.
    2. Sort all the distances obtained (from small to large, smaller means more similar).
    3. Get the classification labels corresponding to the first k (k is generally less than or equal to 20) sample data.
  3. Find the classification label that appears most frequently among the k pieces of data as the classification of the new data.

KNN popular understanding

Given a training data set, for a new input instance, find the k instances closest to the instance in the training data set. If most of these k instances belong to a certain class, the input instance is classified into this class.

KNN development process

收集数据:任何方法
准备数据:距离计算所需要的数值,最好是结构化的数据格式
分析数据:任何方法
训练算法:此步骤不适用于 k-近邻算法
测试算法:计算错误率
使用算法:输入样本数据和结构化的输出结果,然后运行 k-近邻算法判断输入数据分类属于哪个分类,最后对计算出的分类执行后续处理

KNN algorithm characteristics

优点:精度高、对异常值不敏感、无数据输入假定
缺点:计算复杂度高、空间复杂度高
适用数据范围:数值型和标称型

KNN project case

Project Case 1: Optimizing the Matching Effect of Dating Websites

Project Overview

Helen uses a dating website to find a date. After a period of time, she discovered that she had dated three types of people:

  • people who don't like
  • mediocre charming person
  • Very charming person

She hopes:

  1. Weekday dates with averagely attractive people
  2. Weekend date with a very attractive person
  3. Those who don't like it will be eliminated directly.

Now she has collected data that dating sites don't record, which can help categorize matches.

Development Process
收集数据:提供文本文件
准备数据:使用 Python 解析文本文件
分析数据:使用 Matplotlib 画二维散点图
训练算法:此步骤不适用于 k-近邻算法
测试算法:使用海伦提供的部分数据作为测试样本。
        测试样本和非测试样本的区别在于:
            测试样本是已经完成分类的数据,如果预测分类与实际类别不同,则标记为一个错误。
使用算法:产生简单的命令行程序,然后海伦可以输入一些特征数据以判断对方是否为自己喜欢的类型。

Collect data: Provide text file

Helen stores the data of these dating partners in the text file datingTestSet2.txt , with a total of 1,000 lines. Helen’s dating partners mainly include the following three characteristics:

  • Frequent flyer miles earned per year
  • Percent of time spent playing video games
  • Liters of ice cream consumed per week

The text file data format is as follows:

40920	8.326976	0.953952	3
14488	7.153469	1.673904	2
26052	1.441871	0.805124	1
75136	13.147394	0.428964	1
38344	1.669788	0.134296	1

Preparing the data: Parsing text files using Python

Parser for converting text records to NumPy

def file2matrix(filename):
   """
   Desc:
       导入训练数据
   parameters:
       filename: 数据文件路径
   return: 
       数据矩阵 returnMat 和对应的类别 classLabelVector
   """
   fr = open(filename)
   # 获得文件中的数据行的行数
   numberOfLines = len(fr.readlines())
   # 生成对应的空矩阵
   # 例如:zeros(2,3)就是生成一个 2*3的矩阵,各个位置上全是 0 
   returnMat = zeros((numberOfLines, 3))  # prepare matrix to return
   classLabelVector = []  # prepare labels return
   fr = open(filename)
   index = 0
   for line in fr.readlines():
       # str.strip([chars]) --返回已移除字符串头尾指定字符所生成的新字符串
       line = line.strip()
       # 以 '\t' 切割字符串
       listFromLine = line.split('\t')
       # 每列的属性数据
       returnMat[index, :] = listFromLine[0:3]
       # 每列的类别数据,就是 label 标签数据
       classLabelVector.append(int(listFromLine[-1]))
       index += 1
   # 返回数据矩阵returnMat和对应的类别classLabelVector
   return returnMat, classLabelVector

Analyze data: use Matplotlib to draw a two-dimensional scatter plot

import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:, 0], datingDataMat[:, 1], 15.0*array(datingLabels), 15.0*array(datingLabels))
plt.show()

In the figure below, the first and second column attributes of the matrix are used to achieve a good display effect. Three different sample classification areas are clearly identified. People with different hobbies also have different category areas.

Matplotlib scatter plot

serial number Percent of time spent playing video games Frequent flyer miles earned per year Liters of ice cream consumed per week Sample classification
1 0.8 400 0.5 1
2 12 134 000 0.9 3
3 0 20 000 1.1 2
4 67 32 000 0.1 2

The distance between sample 3 and sample 4:
( 0 − 67 ) 2 + ( 20000 − 32000 ) 2 + ( 1.1 − 0.1 ) 2 \sqrt{(0-67)^2 + (20000-32000)^2 + (1.1- 0.1)^2 }(067)2+(2000032000)2+(1.10.1)2

Normalize feature values ​​to eliminate the influence caused by different magnitudes between features

Normalization definition: I think so. Normalization is to limit the data you need to process (through a certain algorithm) to a certain range you need. First of all, normalization is for the convenience of subsequent data processing, and secondly, it is to speed up the convergence when the correction program is running. The methods are as follows:

  1. Linear function conversion, the expression is as follows:

    y=(x-MinValue)/(MaxValue-MinValue)

    Note: x and y are the values ​​before and after conversion respectively, MaxValue and MinValue are the maximum and minimum values ​​of the sample respectively.

  2. Logarithmic function conversion, the expression is as follows:

    y=log10(x)

    Description: Logarithmic function conversion with base 10.

    As shown in the picture:

Logarithmic function image

  1. Inverse cotangent function conversion, the expression is as follows:

    y=arctan(x)*2/PI

    As shown in the picture:

Inverse cotangent function graph

  1. Equation (1) converts the input value into a value in the [-1,1] interval, and uses Equation (2) at the output layer to convert it back to the initial value, where and represent the maximum and minimum values ​​of the load in the training sample set respectively.

In statistics, the specific role of normalization is to summarize the statistical distribution of a unified sample. Normalization between 0-1 is a statistical probability distribution, and normalization between -1–+1 is a statistical coordinate distribution.

def autoNorm(dataSet):
    """
    Desc:
        归一化特征值,消除特征之间量级不同导致的影响
    parameter:
        dataSet: 数据集
    return:
        归一化后的数据集 normDataSet. ranges和minVals即最小值与范围,并没有用到

    归一化公式:
        Y = (X-Xmin)/(Xmax-Xmin)
        其中的 min 和 max 分别是数据集中的最小特征值和最大特征值。该函数可以自动将数字特征值转化为0到1的区间。
    """
    # 计算每种属性的最大值、最小值、范围
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    # 极差
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    # 生成与最小值之差组成的矩阵
    normDataSet = dataSet - tile(minVals, (m, 1))
    # 将最小值之差除以范围组成矩阵
    normDataSet = normDataSet / tile(ranges, (m, 1))  # element wise divide
    return normDataSet, ranges, minVals

Training Algorithm: This step does not apply to the k-nearest neighbor algorithm

Because the test data must be compared with the full training data every time, this process is not necessary.

kNN algorithm pseudo code:

对于每一个在数据集中的数据点:
    计算目标的数据点(需要分类的数据点)与该数据点的距离
    将距离排序:从小到大
    选取前K个最短距离
    选取这K个中最多的分类类别
    返回该类别来作为目标数据点的预测值
def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    #距离度量 度量公式为欧氏距离
    diffMat = tile(inX, (dataSetSize,1)) – dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances**0.5
    
    #将距离排序:从小到大
    sortedDistIndicies = distances.argsort()
    #选取前K个最短距离, 选取这K个中最多的分类类别
    classCount={
    
    }
    for i in range(k)
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

Test algorithm: Use some of the data provided by Helen as test samples. If the predicted class is different from the actual class, it is marked as an error.

kNN classifier test code for dating website

def datingClassTest():
    """
    Desc:
        对约会网站的测试方法
    parameters:
        none
    return:
        错误数
    """
    # 设置测试数据的的一个比例(训练数据集比例=1-hoRatio)
    hoRatio = 0.1  # 测试范围,一部分测试一部分作为样本
    # 从文件中加载数据
    datingDataMat, datingLabels = file2matrix('db/2.KNN/datingTestSet2.txt')  # load data setfrom file
    # 归一化数据
    normMat, ranges, minVals = autoNorm(datingDataMat)
    # m 表示数据的行数,即矩阵的第一维
    m = normMat.shape[0]
    # 设置测试的样本数量, numTestVecs:m表示训练样本的数量
    numTestVecs = int(m * hoRatio)
    print 'numTestVecs=', numTestVecs
    errorCount = 0.0
    for i in range(numTestVecs):
        # 对数据测试
        classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :], datingLabels[numTestVecs:m], 3)
        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])
        if (classifierResult != datingLabels[i]): errorCount += 1.0
    print "the total error rate is: %f" % (errorCount / float(numTestVecs))
    print errorCount

Use algorithm: Generate a simple command line program, and then Helen can input some characteristic data to determine whether the other party is the type she likes.

Dating website prediction function

def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(raw_input("percentage of time spent playing video games ?"))
    ffMiles = float(raw_input("frequent filer miles earned per year?"))
    iceCream = float(raw_input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream])
    classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels, 3)
    print "You will probably like this person: ", resultList[classifierResult - 1]

The actual operation effect is as follows:

>>> classifyPerson()
percentage of time spent playing video games?10
frequent flier miles earned per year?10000
liters of ice cream consumed per year?0.5
You will probably like this person: in small doses

Project Case 2: Handwritten Number Recognition System

Project Overview

Construct a handwritten digit recognition system based on KNN classifier that can recognize the digits 0 to 9.

The numbers to be recognized are black and white images stored in a text file with the same color and size: width and height are 32 pixels * 32 pixels.

Development Process
收集数据:提供文本文件。
准备数据:编写函数 img2vector(), 将图像格式转换为分类器使用的向量格式
分析数据:在 Python 命令提示符中检查数据,确保它符合要求
训练算法:此步骤不适用于 KNN
测试算法:编写函数使用提供的部分数据集作为测试样本,测试样本与非测试样本的
         区别在于测试样本是已经完成分类的数据,如果预测分类与实际类别不同,
         则标记为一个错误
使用算法:本例没有完成此步骤,若你感兴趣可以构建完整的应用程序,从图像中提取
         数字,并完成数字识别,美国的邮件分拣系统就是一个实际运行的类似系统

Collect data: Provide text file

The directory trainingDigits contains about 2000 examples. The content of each example is as shown in the figure below. Each number has about 200 samples; the directory testDigits contains about 900 test data.

Example of handwritten digits dataset

Prepare data: Write function img2vector() to convert image text data into vectors used by the classifier

Convert image text data to vector

def img2vector(filename):
    returnVect = zeros((1,1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0,32*i+j] = int(lineStr[j])
    return returnVect

Analyze the data: Check the data in the Python command prompt to make sure it meets the requirements

Test the img2vector function by entering the following command on the Python command line and compare it to a file opened in a text editor:

>>> testVector = kNN.img2vector('testDigits/0_13.txt')
>>> testVector[0,0:32]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> testVector[0,32:64]
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Training Algorithm: This step does not apply to KNN

Because the test data must be compared with the full training data every time, this process is not necessary.

Test the algorithm: Write a function that uses a portion of the provided dataset as a test sample and flags it as an error if the predicted class is different from the actual class

def handwritingClassTest():
    # 1. 导入训练数据
    hwLabels = []
    trainingFileList = listdir('db/2.KNN/trainingDigits')  # load the training set
    m = len(trainingFileList)
    trainingMat = zeros((m, 1024))
    # hwLabels存储0~9对应的index位置, trainingMat存放的每个位置对应的图片向量
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]  # take off .txt
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        # 将 32*32的矩阵->1*1024的矩阵
        trainingMat[i, :] = img2vector('db/2.KNN/trainingDigits/%s' % fileNameStr)

    # 2. 导入测试数据
    testFileList = listdir('db/2.KNN/testDigits')  # iterate through the test set
    errorCount = 0.0
    mTest = len(testFileList)
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('.')[0]  # take off .txt
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('db/2.KNN/testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
        print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
        if (classifierResult != classNumStr): errorCount += 1.0
    print "\nthe total number of errors is: %d" % errorCount
    print "\nthe total error rate is: %f" % (errorCount / float(mTest))

Using the algorithm: This step is not completed in this example. If you are interested, you can build a complete application to extract numbers from images and complete number recognition. The mail sorting system in the United States is a similar system that actually operates.

KNN summary

What is KNN? Definition: Supervised learning? Unsupervised learning?

KNN is a simple non-explicit learning process and a non-generalization learning supervised learning model. It has applications in both classification and regression.

Fundamental

To put it simply: Calculate the distance between the query point and each training data point through distance measurement, then select the K nearest neighbors that are close to the query point (query point), and use classification decision-making To select the corresponding label as the label of the query point.

KNN three elements

K, the value of K

It has a significant impact on query point labels (outstanding results). When the k value is small, the approximation error is small and the estimation error is large. When the k value is large, the approximation error is large and the estimation error is small.

If you choose a smaller k value, it is equivalent to using training instances in a smaller neighborhood for prediction. The approximation error of "learning" will be reduced, and only training examples that are closer (similar) to the input instance will be made. Instances will play a role in the prediction results. But the disadvantage is that the estimation error of "learning" will increase, and the prediction results will be very sensitive to nearby instance points. If nearby instance points happen to be noise, the prediction will be wrong. In other words, the reduction of k value means that the overall model becomes complex and prone to overfitting.

If you choose a larger k value, it is equivalent to using training instances in a larger neighborhood to make predictions. The advantage is that it can reduce the estimation error of learning. But the disadvantage is that the learning approximation error will increase. At this time, training instances that are far away from the input instance (dissimilar) will also affect the prediction, causing the prediction to be wrong. An increase in the k value means that the overall model becomes simpler.

Neither too big nor too small is good. You can use cross validation to select a suitable k value.

Approximation error and estimation error, please see here: https://www.zhihu.com/question/60793482

Metric/Distance Measure

The distance metric is usually Euclidean distance, but it can also be Minkowski distance or Manhattan distance. It can also be some distance formula in geographical space. (For more details, please refer to the valid_metric section in sklearn)

classification decision (decision rule)

In classification problems, the classification decision is usually to select the label with the most votes through majority rule. In regression problems, it is usually the average of the labels of the K nearest neighbors.

Algorithm: (there are three types on sklearn)

Brute Force brute force calculation/linear scan

KD Tree uses a binary tree to bisect the parameter space according to the data dimensions.

Ball Tree uses a series of hyperspheres to bisect the training data set.

Tree structure algorithms have two processes: tree building and query. Brute Force has no building process.

Algorithm features:

优点: High Accuracy, No Assumption on data, not sensitive to outliers

Disadvantages: high time and space complexity

Scope of application: continuous values ​​and nominal values

Similar homologous products:

radius neighbors Find neighbors based on the specified radius

Factors affecting the algorithm:

N is the number of samples in the data set, D is the data dimension (number of features)

Total consumption:

Brute Force: O[DN^2]

What is considered here is the stupidest method: counting the distances between all training points. Of course, there are faster implementations, such as O(ND + kN) and O(NDK), and the fastest is O[DN]. If you are interested, you can read this link: k-NN computational complexity

KD Tree: O[DN log(N)]

Ball Tree: O[DN log(N)] is in the same order of magnitude as KD Tree. Although the tree construction time will be longer than KD Tree, the query speed is greatly improved in highly structured data, even high-latitude data. promote.

Query required consumption:

Brute Force: O[DN]

KD Tree: When the dimension is relatively small, such as D<20, O[Dlog(N)]. On the contrary, it will tend to O[DN]

Ball Tree: O[Dlog(N)]

When the data set is relatively small, such as N<30, Brute Force has more advantages.

Intrinsic Dimensionality and Sparsity

The intrinsic dimensionality of data refers to the dimension d < D of the manifold where the data is located, which can be linear or nonlinear in the parameter space. Sparsity refers to the degree to which the data fills the parameter space (this is different from the concept used in "sparse" matrices, a data matrix may not have zero entries, but in this sense its structure is still "sparse").

Query times for Brute Force are not affected.

For the query time of KD Tree and Ball Tree, the query time of data sets with smaller intrinsic dimensions and sparser is faster. The improvement of KD Tree is not as significant as that of Ball Tree due to its own characteristics of bisecting the parameter space through the coordinate axis.

The value of k (k neighboring points)

Brute Force's query time is basically unaffected.

But for KD Tree and Ball Tree, the larger k is, the slower the query time is.

k When N accounts for a large proportion, it is better to use Brute Force.

Number of Query Points (number of query points, that is, the number of test data)

Brute Force is used when there are few query points. When there are many query points, the tree structure algorithm can be used.

Some additional information about models in sklearn:

If the application scenarios of KD Tree, Ball Tree and Brute Force are unclear, you can directly use the module containing algorithm='auto'. algorithm='auto' automatically selects the optimal algorithm for you.
There are regressor and classifier to choose from.

metric/distance measure is optional. In addition, the distance can be weighted by weight.

The impact of leaf size on KD Tree and Ball Tree

Tree establishment time: When the leaf size is larger, the tree establishment time will be faster.

Query time: leaf size is not good if it is too large or too small. If the leaf size tends to N (the number of samples in the training data), the algorithm is actually brute force. If the leaf size is too small, tending to 1, then the time to traverse the tree during query will be greatly increased. The recommended value for leaf size is 30, which is the default value.

Memory: The leaf size becomes larger, and the memory for storing the tree structure becomes smaller.

Nearest Centroid Classifier

The classification decision is which label's centroid is closest to the test point, which label is selected.

The model assumes equal variance in all dimensions. It's a good base line.

Advanced version: Nearest Shrunken Centroid

Can be set via shrink_threshold.

Function: Can remove certain features that affect classification, such as removing the impact of noise features


  • Reference information comes from ApacheCN

Guess you like

Origin blog.csdn.net/diaozhida/article/details/84957234