K- nearest neighbor Summary

K-nearest neighbor (K-Nearest Neighbor, KNN) algorithm is a basic classification and regression method, one of the most simple machine learning methods, here only do a summary of the K nearest neighbor classification algorithm.

K-nearest neighbor algorithm is simple and intuitive, it works is: Given a training data set, the new input instance, in the training data set to find the instance with the nearest neighbors \ (k \) instances, it \ (k \ ) a plurality of instances belongs to a class, put into the input instance of this class.

The basic elements of 1 K nearest neighbor model

K-nearest neighbor model three basic elements - the distance metric, \ (k \) selection and decision rules of classification values

1.1 Distance Measurement

Examples of feature space from two points reflect the degree of similarity of two points of the instances. \ (K \) feature space neighbor model generally \ (n-\) dimensional real vector space \ (n-R & lt ^ \) , the distance used is Euclidean distance, but may be other distances.
Take two features in the feature space \ (x_i \) , \ (x_j \) , which are \ (n-\) dimensional vector. \ (x_i \) , \ (x_j \) is \ (L_p \) distance is defined as
$L_{p} (x_{i}, x_{j}) = (\sum_{l = 1}^{n} | x_{(l)}^{i} - x_{(l)}^{j} |^{p})^{\frac{1}{p}}$

Where \ (p ≥ 1 \) . When \ (p = 2 \) , the Euclidean distance is called
$L_{p} (x_{i}, x_{j}) = (\sum_{l = 1}^{n} | x_{(l)}^{i} - x_{(l)}^{j} |^{2})^{\frac{1}{2}}$
When \ (p = 1 \) , the Manhattan distance is called
$L_{p} (x_{i}, x_{j}) = \sum_{l = 1}^{n} | x_{(l)}^{i} - x_{(l)}^{j} |$

1.2 \ (K \) selecting values of

1.3 classification decision rule

2 K nearest neighbor algorithm described algebraically

Input: training data set
$T = {(x_{1}, Y_{1}), (x_{2}, Y_{2}), . . ., (x_{N}, Y_{N})}$
among them, $x_{i} \in x \subseteq R^{n}$ Examples of feature vectors, $Y_{i} \in {c_{1}, c_{2}, ∙ ∙ ∙, c_{K}}$ As a category instance, $i = 1, 2, . . ., N$ ; Examples of feature vector \ (X \) ;
Output: Example \ (X \) class belongs \ (Y \) .
(. 1) according to the given distance metric, found with the training set in T \ (X \) Closest \ (K \) points covering the \ (K \) points \ (X \) is the art referred to as $N_{k} (x)$ ;
(2) $N_{k} (x)$ Decided according to the classification decision rule \ (x \) category \ (the y-\) :
$Y = \underset{}{\underset{c_{J}}{a r g m a x} \sum_{x_{i} \in N_{k} (x)} I (Y_{i} = c_{j}), i = 1, 2, ∙ ∙ ∙, N; j = 1, 2, ∙ ∙ ∙, K}$
Wherein \ (the I \) is the indicator function, that is, when $Y_{i} = c_{j}$ When \ (I \) is 1, otherwise \ (i \) 0

3 K Code neighbor algorithm to achieve

3.1 Preparation: Using Python import data

from numpy import*  #NumPy科学计算包
import operator  #运算符模块

#创建数据集和标签
def createDataSet():
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels

group, labels = createDataSet()
print(group, labels)
#输出
[[ 1.   1.1]
[ 1.   1. ]
[ 0.   0. ]
[ 0.   0.1]] ['A', 'A', 'B', 'B']

3.2 KNN algorithm implementation

Each point on the unknown class attribute data set sequentially perform the following operations:

(1) calculate the distance between the point known class data set and the current point;

(2) sorted in ascending order of distance;

(3) Select the minimum k points from the current point;

(4) determining the frequency of occurrence of the first k classes point is located;

(5) before returning to the highest frequency class k points appear as a predicted classification of the current point.

def classify0(inX, dataSet, labels, k):

    #计算距离
    dataSetSize = dataSet.shape[0]  #获取训练数据的行数
    diffMat = tile(inX, (dataSetSize, 1)) - dataSet # 现将测试数据的行升维使测试数据和训练数据维度相同 再相减得到向量
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1) #得到平方后的每个数组内元素的和
    distances = sqDistances**0.5
    sortedDistIndicies = distances.argsort() #将距离按升序排列 argsort()该函数返回的是数组元素的下标

    #选择距离最小的k个点 确定前k个距离最小元素所在的主要分类
    classCount = {} #为了直观的看到不同数据类别的出现次数，设置一个空字典 最终以元组列表存放数据
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]  #获取数据类型
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1  #统计每个数据类型的出现次数
    sortedClassCount = sorted(classCount.items(),key=operator.itemgetter(1), reverse=True) #将字典中的元素按出现次数降序排列 即itemgetter对元组第二个元素排序
    return sortedClassCount[0][0] #返回出现次数最多的数据类型

classify0([0,0], group, labels, 3)   #B

4 K nearest neighbor algorithm example

4.1 k- nearest neighbor on a dating site

4.1.1 Data preparation: parse the data from a text file

#将文本记录转换为Numpy的解析程序
def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    numberOfLines = len(arrayOLines) #文件行数
    returnMat = zeros((numberOfLines, 3))  #创建以0填充的NumPy矩阵
    classLabelVector = []
    index = 0
    for line in arrayOLines:   #解析文件数据到列表 循环处理文件中的每行数据
        line = line.strip()     #截取掉所有的回车字符
        listFromLine = line.split('\t') #用tab字符\t将上一步得到的整行数据分割成一个元素列表
        returnMat[index,:] = listFromLine[0:3] #选取前3个元素 将它们存储到特征矩阵中
        classLabelVector.append(int(listFromLine[-1]))  #将列表的最后一列存储到向量classLabelVector中 其中指明列表中存储的元素值为整型
        index += 1
    return returnMat, classLabelVector


datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
datingDataMat

#输出
[[  4.09200000e+04   8.32697600e+00   9.53952000e-01]
[  1.44880000e+04   7.15346900e+00   1.67390400e+00]
[  2.60520000e+04   1.44187100e+00   8.05124000e-01]
...,
[  2.65750000e+04   1.06501020e+01   8.66627000e-01]
[  4.81110000e+04   9.13452800e+00   7.28045000e-01]
[  4.37570000e+04   7.88260100e+00   1.33244600e+00]]


datingLabels[0:20]
#输出
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

4.1.2 Data Analysis: Create a scatter plot using Matplotlib

import matplotlib   
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False

fig = plt.figure()
ax = fig.add_subplot(111)
#ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2])  #datingDataMat矩阵的第二 第三列数据
ax.scatter(datingDataMat[:,1], datingDataMat[:, 2], 15.0*array(datingLabels), 15.0*array(datingLabels)) #彩色

plt.xlabel('玩视频游戏所耗时间百分比')
plt.ylabel('每周消费的冰淇淋公升数')
plt.show()

4.1.3 Data preparation: normalized value

#归一化特征值
def autoNorm(dataSet):
    minVals = dataSet.min(0) #选取每列最小值
    maxVals = dataSet.max(0)  #选取每列最大值
    ranges = maxVals - minVals #计算可能的取值范围
    normDataSet = zeros(shape(dataSet)) #创建返回矩阵
    m = dataSet.shape[0]   #注：特征矩阵1000*3而minVals ranges值都是1*3 使用NumPy中的tile()将变量内容复制成输入矩阵同样大小的矩阵
    normDataSet = dataSet - tile(minVals, (m,1))  #当前值减最小值
    normDataSet = normDataSet/tile(ranges, (m,1)) #除以取值范围 得归一化特征值
    return normDataSet, ranges, minVals

normMat, ranges, minVals = autoNorm(datingDataMat)
normMat, ranges, minVals

#输出
normMat:
[[ 0.44832535  0.39805139  0.56233353]
[ 0.15873259  0.34195467  0.98724416]
[ 0.28542943  0.06892523  0.47449629]
...,
[ 0.29115949  0.50910294  0.51079493]
[ 0.52711097  0.43665451  0.4290048 ]
[ 0.47940793  0.3768091   0.78571804]] 
ranges:
[  9.12730000e+04   2.09193490e+01   1.69436100e+00] 
minVals:
[ 0.        0.        0.001156]

4.1.4 Test method: as a complete program verification classifier

#分类器针对约会网站的测试
def datingClassTest():
    hoRatio = 0.10
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:], normMat[numTestVecs:m,:],
                                     datingLabels[numTestVecs:m], 3)
        print("the classifier came back with: %d, the real answer is: %d" %(classifierResult, datingLabels[i]))
        if(classifierResult != datingLabels[i]):
            errorCount += 1.0
            print("the total error rate is: %f"%(errorCount/float(numTestVecs)))

datingClassTest()

#输出

the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 1, the real answer is: 1
the classifier came back with: 2, the real answer is: 2
the classifier came back with: 3, the real answer is: 3
the classifier came back with: 3, the real answer is: 1
the total error rate is: 0.030000
#分类器处理约会数据集的错误率是3%

4.1.5 algorithm: to build a complete system available system

#约会网站预测函数
def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(input("percentage of time spent playing video games？"))
    ffMiles = float(input("frequent flier miles earned per years?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream])
    classifierResult = classify0((inArr-minVals)/ranges, normMat, datingLabels, 3)
    print("You will probably like this person:", resultList[classifierResult-1])

classifyPerson()
#输出
percentage of time spent playing video games？10
frequent flier miles earned per years?10000
liters of ice cream consumed per year?0.5
You will probably like this person: in small doses

4.2 handwriting recognition systems use k- nearest neighbor algorithm

The configuration of the system can identify numbers 0 to 9. Figures have to be identified using image-processing software, processed to have the same color and size ①: monochrome image width and height is 32 pixels × 32 pixels. While in text format and stores the image can not be effectively utilized within the
memory space, but for ease of understanding, the image is converted to text format.
4.2.1 Preliminary data: the image into a test vector
trainingDigits contains about 2000 example, each digital sample of about 200; testDigits contains approximately 900 test data. trainingDigits the data classifier training, use testDigits the test data classifier
performance. Two sets of data do not overlap.
In order to use the previous two examples classifier, image formatting must be treated as a vector. A binary image matrix of 32 × 32 1 × 1024 converted vector, the first two such classifiers can be used for processing the digital image information.

Convert the image to a vector: This function creates NumPy array of 1 × 1024, and then open the given document, the read cycle of the first 32 lines of the file, and in NumPy array, and returns the value stored in the first 32 characters of each line array.

def img2vector(filename):
    f = open(filename)
    returnVect = zeros((1,1024))
    for i in range(32):
        line = f.readline()
        for j in range(32):
            returnVect[0,i*32+j] = int(line[j])
    return returnVect

testVector = img2vector('testDigits/0_13.txt')
testVector[0,0:31]
#输出
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  1.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

testVector[0,32:63]
#输出
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  1.  1.  1.  1.  1.
  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

4.2.2 Test method: using k-nearest neighbor digital handwriting recognition

def handwritingClassTest():
    fileList = os.listdir('trainingDigits')  #获取目录内容
    m = len(fileList)
    traingMat = zeros((m, 1024))
    hwlabels = []
    for i in range(m):  #从文件名解析分类数字
        fileName = fileList[i]
        prefix = fileName.split('.')[0]
        number = int(prefix.split('_')[0])
        hwlabels.append(number)
        traingMat[i,:] = img2vector('trainingDigits/%s' %fileName)
    testFileList = os.listdir('testDigits')
    m = len(testFileList)
    errorNum = 0.0
    for i in range(m):
        testFileName = testFileList[i]
        prefix = testFileList[i].split('.')[0]
        realNumber = int(prefix.split('_')[0])
        testMat = img2vector('testDigits/%s' %testFileName)
        testResult = classify0(testMat, traingMat, hwlabels, 3)
        if testResult != realNumber:
            errorNum += 1
        print('The classifier came back with: %d, the real answer is: %d' %(testResult, realNumber))
    print("\nthe total number of errors is: %d" % errorNum)
    print('\nthe total error rate is %f' %(errorNum/float(m)))

handwritingClassTest()
#输出
The classifier came back with: 0, the real answer is: 0
The classifier came back with: 0, the real answer is: 0
.
.
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 7, the real answer is: 7
The classifier came back with: 8, the real answer is: 8
The classifier came back with: 8, the real answer is: 8
The classifier came back with: 8, the real answer is: 8
.
.
The classifier came back with: 9, the real answer is: 9
The classifier came back with: 9, the real answer is: 9
the total number of errors is: 10
the total error rate is 0.010571
#K近邻算法识别手写数字数据集 错误率1.01%