[Machine Learning] knn (principle analysis + code implementation)

Let me give you a chestnut first:
an unknown species came to the zoo. By comparing the similarity between it and every animal in the zoo, we picked out the 5 animals (k=5) that looked the most like it, and 3 of them It's just a horse, one is a donkey, and one is a cow, so we can tell that the new animal is a horse.

1. Overview of KNN (K Nearest Neighbors)

  • Machine learning can be divided into: supervised learning, unsupervised learning, weakly supervised learning, and reinforcement learning.
  • Supervised learning is divided into classification problems and regression problems.
  • KNN mainly solves classification problems.
  • It has similarities with K-means, but K-means is an unsupervised learning algorithm.

2. KNN principle

  • Knowing the corresponding relationship between each data and label in the training sample set, input new data without labels, compare the features with all the data in the sample set, and the algorithm extracts the labels of the top k most similar (closest) data. (Looking back at the example at the beginning of the article, it is very simple.)
  • There are many ways to calculate the distance: Euclidean distance, Manhattan distance...

3. Code implementation:

Core code: KNN implementation

def kNN(in_x, x_train, y_train, k):
    x_train_size = x_train.shape[0]#行数
    distances = (np.tile(in_x, (x_train_size, 1)) - x_train) ** 2# np.tile()行扩展,再矩阵相减
    sum_distances = distances.sum(axis=1)#axis=1,是行和
    sq_distances = sum_distances ** 0.5
    sort_distances_index = sq_distances.argsort()#...argsort()从小到大排序,返回索引序列
    classdict = {
    
    }#存放前k个最近数据的标签:次数
    for i in range(k):
        vote_label = y_train[sort_distances_index[i]]#获得标签
        classdict[vote_label] = classdict.get(vote_label, 0) + 1#确定次数
    sort_classdict = sorted(classdict.items(), key=operator.itemgetter(1), reverse=True)
    return sort_classdict[0][0]

3.1 Case: Handwritten Digit Recognition

Compare a handwritten digital picture with each picture in the data set, store all numbers as 01, and select the most similar digital picture as the number corresponding to the picture to be classified.
step1: A picture is converted into a one-dimensional array, and a data with 1024 features

Data source: https://www.manning.com/books/machine-learning-in-action (corresponding to 02/digits)

def img2vector(filename):
    ret_vec = np.zeros((1, 1024))#转化为一笔data有1024个特征!!!
    fr = open(filename)
    for i in range(32):#一行一行处理
        line_str = fr.readline()
        for j in range(32):
            ret_vec[0,  i*32+j] = int(line_str[j])
    return ret_vec    
test_vec = img2vector('trainingDigits/0_0.txt')
print(test_vec[0, 0:31])
print(test_vec[0, 32:63])

Step2: The training set is fed to KNN, and the prediction results and accuracy of the test set
are processed. The training data set and the test data set are processed first, and each picture of the test machine is classified on the training set, and the classification error is given.

from os import listdir
def hw_classify():
    #先训练集
    training_file_list = listdir('trainingDigits')#训练集文件列表
    m = len(training_file_list)#训练集数字个数
    
    hw_x_train = np.zeros((m, 1024))#训练集处理结果
    hw_y_train = []#训练集标签
    
    for i in range(m):
        file_name_str = training_file_list[i]#从文件名获取分类数字
        file_name_str0 = file_name_str.split('.')[0]
        class_num = file_name_str0.split('_')[0]
        
        hw_y_train.append(class_num)#加入训练集标签y_train
        
        hw_x_train[i, :] = img2vector('trainingDigits/%s'%file_name_str)#加入训练集x_train
    
    #后测试集
    test_file_list = listdir('testDigits')#测试集文件列表
    m_test = len(test_file_list)#测试集文件数
    
    error_count = 0.0
    
    for i in range(m_test):
        file_name_str = test_file_list[i]#从文件名获取分类数字
        file_name_str0 = file_name_str.split('.')[0]
        class_num = file_name_str0.split('_')[0]
        
        in_x = img2vector('testDigits/%s'%file_name_str)
        
        classifier_result = kNN(in_x, hw_x_train, hw_y_train, 3)#用KNN预测每笔测试数据的分类结果
        
        print('classifier_result:{}, real_answer:{}'.format(classifier_result, class_num))
        
        if(classifier_result != class_num):
            error_count += 1.0
            
    print('total number of error : {}'. format(error_count))
    print('error rate  : {}'.format(error_count/m_test))

hw_classify()

Partial running results:
insert image description here
Among them, error rate indicates the classification error rate.

4. Defects of KNN

Time-consuming: The distance between each test data and each data in the training set needs to be calculated.
Space: To save all datasets.

Guess you like

Origin blog.csdn.net/weixin_44820505/article/details/125810140