Machine Learning Series 2 | k Nearest Neighbor Classification Algorithm

1 Overview

k- nearest neighbor algorithm uses the method of measuring the distance between the characteristic values ​​of different classification.

2 advantages and disadvantages

Advantages: high accuracy, is insensitive to outliers, assuming no data input

Disadvantages: the computational complexity, the spatial complexity is high, can not be saved into the model

Applicable Data range: numeric and nominal type

3 Data Preparation

3.1 Data Preparation

Vector to be tested. We want the data to predict

Characteristic dataset training data set that does not contain the target vector

Vector label consisting of (vector target variable composition)

k, is what we're looking for before the number of similar, generally not more than 20

3.2 data reduction

In the last article we talked about normalization, providing a simpler normalization following formula:

new_value=(old_value-min)/(max-min)
 
old_value:原来的值
min:在数据集中该特征最小的值
max:在数据集中该特征最大的值

那我们看看如何用代码进行实现这个归一化吧。
import numpy
class Normalization(object):
    def auto_norm(self, matrix):
        """
        数据清洗,归一化
        new_value=(old_value-min)/(max-min)
        :param matrix: 矩阵
        :return: 归一化的矩阵,范围数据,最小值
        """
        # 0表示从列中选值
        # 每列的最小值组成一个向量
        min_value = matrix.min(0)
        # 每列的最大值组成一个向量
        max_value = matrix.max(0)
        # 每列的范围值
        ranges = max_value - min_value
 
        m = matrix.shape[0]
        norm_matrix = numpy.zeros(numpy.shape(matrix))
        # 分子
        norm_matrix = matrix - numpy.tile(min_value, (m, 1))
        # 不是矩阵除法,矩阵除法是linalg.solve(matA,matB)
        norm_matrix = norm_matrix / numpy.tile(ranges, (m, 1))
 
        return norm_matrix, ranges, min_value

Principle 4

4.1 ALGORITHM

Prepared above said data
enter new data, the new data is copied into the training data set as a matrix, and each set of training data vectors calculated Euclidean distance
to the calculated Euclidean distance data is sorted in ascending ( European numerical values smaller distance, the more similar), obtaining an array of indexed positions consisting
of the first k index position taken, then the first k data tag (target variable) as a key, then the count value is accumulated
and then this count map sort sorted in descending value for
data acquired last data of the first rank, is the most similar to the
4.2 Euclidean distance

The following formula is quite simple, we are here simply meaning under variable.
For example, we have a set of data characteristics is 5, then
i: is the time to digital polling, as where n is 5, the order of 1,2,3,4,5 i

n: is the greatest number, 5

Σ: The meaning of the symbols is the sum of the later summation value calculated equation, i ~ n is for each value of i into Equation

xi: We know i'll 1,2,3,4,5 respectively, it represents the first feature x1, x2 represents the second feature, ...

yi: empathy and xi

5 Code

import operator
 
from numpy import *
 
class kNN(object):
 
    def createDataSet(self):
        """
        创建测试数据集
        :return:矩阵,标签
        """
        group = numpy.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
        labels = ['A', 'A', 'B', 'B']
        return group, labels
        
    def classify0(self, inX, dataSet, labels, k):
        """
        k-近邻,欧式距离计算两个向量的距离
        :param inX: 输入向量
        :param dataSet: 训练样本集
        :param labels: 标签向量
        :param k: 最近邻居的数目
        :return: 最近的结果
        """
 
        # 计算欧式距离
 
        # 获得行数
        dataSetSize = dataSet.shape[0]
 
        # 将向量inx纵向复制变成矩阵跟dataSet的数量一样,再减去数据集
        diffMat = tile(inX, (dataSetSize, 1)) - dataSet
 
        # 矩阵平方
        sqDiffMat = diffMat ** 2
 
        # 矩阵每行求和
        sqDIstances = sqDiffMat.sum(axis=1)
 
        # 数组每个值开方
        distances = sqDIstances ** 0.5
 
        # 数组值从小到大的索引号
        sortedDistIndicies = distances.argsort()
 
        # 选最距离最小的k个距离
        classCount = {}
        for i in range(k):
            # 通过索引值获取标签
            voteIlabel = labels[sortedDistIndicies[i]]
            # 累加次数
            classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
 
        # 根据次数从大到小排序
        sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
        return sortedClassCount[0][0]
 
 
if __name__ == '__main__':
    kNN = kNN()
    group, labesl = kNN.createDataSet()
    result = kNN.classify0([0, 0], group, labesl, 3)
    print(result)

The results of the above code is B! Simple code, you should read it.

6 combat

After a simple calculation above, but also into our real project. Below we will calculate simple small projects a handwriting recognition system, right!
6.1 Data Preparation
First, our data can be downloaded through my github! I point digits is stored in the directory data we need to be!
6.2 Preparation algorithm

First, we write the program, what steps to go through it?
View the data, the data is how look like? How the data is characterized by discrete

Write feature transformation algorithm, the data into a single vector, into a plurality of data matrix

The accuracy of input test vectors, test algorithm model!

import numpy
import operator
from numpy import *
 
class kNN(object):
    def img2vector(self, filename):
        """
        图片txt转向量
        :param filename: 文件名
        :return: 向量
        """
        # 创建一个1024维度的向量
        return_vec = numpy.zeros((1, 1024))
 
        # 将数据导入到向量
        with open(filename) as fr:
            for i in range(32):
                line = fr.readline()
                # 导入一行数据(32个数字)
                for j in range(32):
                    # 每个数字依次导入
                    return_vec[0, i * 32 + j] = int(line[j])
        return return_vec
        
    def classify0(self, inX, dataSet, labels, k):
        """
        k-近邻,欧式距离计算两个向量的距离
        :param inX: 输入向量
        :param dataSet: 训练样本集
        :param labels: 标签向量
        :param k: 最近邻居的数目
        :return: 最近的结果
        """
 
        # 计算欧式距离
 
        # 获得行数
        dataSetSize = dataSet.shape[0]
 
        # 将向量inx纵向复制变成矩阵跟dataSet的数量一样,再减去数据集
        diffMat = tile(inX, (dataSetSize, 1)) - dataSet
 
        # 矩阵平方
        sqDiffMat = diffMat ** 2
 
        # 矩阵每行求和
        sqDIstances = sqDiffMat.sum(axis=1)
 
        # 数组每个值开方
        distances = sqDIstances ** 0.5
 
        # 数组值从小到大的索引号
        sortedDistIndicies = distances.argsort()
 
        # 选最距离最小的k个距离
        classCount = {}
        for i in range(k):
            # 通过索引值获取标签
            voteIlabel = labels[sortedDistIndicies[i]]
            # 累加次数
            classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
 
        # 根据次数从大到小排序
        sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
        return sortedClassCount[0][0]
        
    def handle_write_class_test(self, train_data_dirname, test_data_dirname):
 
        # 加载训练集
        labels = []
        train_file_list = os.listdir(train_data_dirname)
        train_data_count = len(train_file_list)
        matrix = numpy.zeros((train_data_count, 1024))
        for i in range(train_data_count):
            file_name_ext = train_file_list[i]
            file_name = file_name_ext.split(".")[0]
            file_num = int(file_name.split("_")[0])
            labels.append(file_num)
            matrix[i, :] = self.img2vector("%s/%s" % (train_data_dirname, file_name_ext))
 
        # 加载测试集
        test_file_list = os.listdir(test_data_dirname)
        err_count = 0.0
        test_data_count = len(test_file_list)
        for i in range(test_data_count):
            file_name_ext = test_file_list[i]
            file_name = file_name_ext.split(".")[0]
            file_num = int(file_name.split("_")[0])
            test_vec = self.img2vector("%s/%s" % (test_data_dirname, file_name_ext))
 
            # 测试
            result = self.classify0(test_vec, matrix, labels, 3)
            bool_result = result == file_num
            if not bool_result:
                err_count = err_count + 1.0
            print("result:%s, real:%d, bool:%s" % (result, file_num, bool_result))
 
        print("error count:%f" % (err_count / float(test_data_count)))
 
 
if __name__ == '__main__':
    train_dir = "../data/digits/trainingDigits"
    test_dir = "../data/digits/testDigits"
    kNN = kNN_2_3_2()
    kNN.handle_write_class_test(train_dir, test_dir)

Finally, we get the following result, the error rate is approximately equal to 1.2%, the results were good!

result:0, real:0, bool:True
result:0, real:0, bool:True
result:4, real:4, bool:True
result:9, real:9, bool:True
result:7, real:7, bool:True
result:7, real:7, bool:True
result:1, real:1, bool:True
result:5, real:5, bool:True
result:4, real:4, bool:True
result:3, real:3, bool:True
result:3, real:3, bool:True
error count:0.011628
Published 38 original articles · won praise 1 · views 2196

Guess you like

Origin blog.csdn.net/wulishinian/article/details/104711726