Machine Learning - KNN Algorithm


1. Overview of K-Nearest Neighbor Algorithm

The KNN algorithm belongs to a classification algorithm in our supervised learning. It avoids the words of Wen Zou Zou. In the words of kindergarten, it is to classify unknown things according to the known.
image.png

We ask for the evaluation of Xiao X, how will we do it?

According to our experience, is it the closest score to that paragraph? If you think so, congratulations, we have mastered the basic thinking of the KNN algorithm.

That's right, by comparing the distances between the samples to get our conclusion, we first find the Euclidean distance between Xiao X and all the students .

Of course, there are other Manhattan distances, Chebivs distances, etc. The Euler distance is quite simple. This article uses the Euler distance as an example.

image.png

It is the square root of the difference of each column of data , and then sorts them in order of size. We then delineate a range K value (that is, how many data to choose) to vote for our selected example , because the selected data The evaluation of the tag is determined, so we can directly count the number of tags.

We first define k=3 (after finding the Euclidean distance, the first three values ​​after sorting)

It is obvious here that Xiao Tang, Xiao Huang, and Xiao Bin were selected because of the European-style distance that Xiao X asked for .

Counting votes
Excellent 2 good votes 1 vote

So Xiao X is excellent!

Careful students have found that when the difference between the two of us is relatively large, it will affect our weight. To give an extreme example

image.png

Is the Euclidean distance between Xiao X and Xiao Tang so large? Is (1000-800)^2 much larger than the values ​​of other subjects, so our data needs to be normalized

image.png

2. Three elements of K nearest neighbors

distance measure

It is our European distance just now! However, we will normalize him here, so that each of his values ​​ranges from 0 to 1. It is to find the maximum and minimum values ​​of each of our columns, use their sum as the denominator , and the current value of the column minus the minimum value as the numerator
image.png

k value selection

The choice of k value (that is, after distance sorting, take the first k) is also a knowledge

If the value of Figure a is too small , it will cause our sample to be too small

Figure b is just right

If the value of c is too large , the error will be too large.
image.png

Classification decision procedure

The most commonly used is voting, whoever has more, whoever is!

code

'''
Created on 3/9,2020
@author: ywz

'''


"""
1、距离计算
2、k值
3、决策机制
"""

import numpy as np  # import 导入包/模块  as:取别名
from numpy import *  # *  表示numpy 所有的函数方法


def classify_knn(inx, data_set, labels, k):
    """
    :param inx: vec need to predict classify
    :param data_set: samples
    :param labels: classes
    :param k: the k of knn
    :return: the class of predict
    """
    data_set_size = data_set.shape[0]
    # numpy中的tile函数:复制(被复制对象,(行数,列数))
    diff_mat = tile(inx, (data_set_size, 1)) - data_set
    sqrt_diff_mat = diff_mat**2
    # axis表示求和的方向,axis=0表示同一列相加,axis=1表示同一行相加
    sqrt_distance = sqrt_diff_mat.sum(axis=1)
    distances = sqrt_distance**0.5
    # print('type of distances:', type(distances))  # numpy.ndarray
    # print(distances)
    # distances.argsort()得到的是从小到大排序的索引值
    sorted_distances = distances.argsort()
    class_count = {}
    for i in range(k):
        vote_label = labels[sorted_distances[i]]
        if vote_label not in class_count:
            class_count[vote_label] = 0
        class_count[vote_label] += 1
    # print("class_count:", class_count)
    class_predict = max(class_count.items(), key=lambda x: x[1])[0]
    return class_predict


def file2matrix(filename):
    """
    :读取文件,返回文件中的特征矩阵和标签值
    :param filename:
    :return: features_matrix, label_vec
    """
    fr = open(filename) # open-->文件句柄 
    lines = fr.readlines()  # read,

    print("lines:")
    print(lines)
    num_samples = len(lines) 
    mat = zeros((num_samples, 3))  
    class_label_vec = []
    index = 0
    for line in lines:
        line = line.strip() 
        list_line = line.split('\t')  
        mat[index, :] = list_line[0:3]  
        class_label_vec.append(int(list_line[-1]))
        index += 1
    return mat, class_label_vec

mat, class_label_vec = file2matrix("datingTestSet2.txt")
# print(mat.shape)
# print(class_label_vec.shape) # ???


def auto_norm(data_set):
    """
    :param data_set:
    :return:
    """
    min_val = data_set.min(0)
    max_val = data_set.max(0)
    ranges = max_val - min_val
    norm_data_set = zeros(shape(data_set))
    num_samples = data_set.shape[0]
    # 归一化:newValue = (oldValue - minValue) / (maxValue - minValue)
    norm_data_set = data_set - tile(min_val, (num_samples, 1))
    norm_data_set = norm_data_set/tile(ranges, (num_samples, 1))
    return norm_data_set, ranges, min_val


def test_knn(file_name):
    """
    :return:
    """
    ratio = 0.1
    data_matrix, labels = file2matrix(file_name)
    nor_matrix, ranges, minval = auto_norm(data_matrix)
    num_samples = data_matrix.shape[0]
    num_test = int(ratio*num_samples)
    k_error = {}
    k = 3
    error_count = 0
    predict_class = []
    
    
    # 数据集中的前num_test个用户假设未知标签,作为预测对象
    for i in range(num_test):
        classifier_result = classify_knn(nor_matrix[i, :], nor_matrix[num_test:, :],
                                          labels[num_test:], k)
        predict_class.append(classifier_result)
        if classifier_result != labels[i]:
            error_count += 1
    print("k={},error_count={}".format(k, error_count))
    k_error[k] = error_count
    print("误差情况:", k_error)
    print("预测情况:", predict_class[:10])
    print("真实情况:", labels[:10])
    return k_error
    
    # for k in range(1, 31):
    #     error_count = 0
    #     for i in range(num_test):
    #         classifier_result = classify_knn(nor_matrix[i, :], nor_matrix[num_test:, :],
    #                                           labels[num_test:], k)
    #         if classifier_result != labels[i]:
    #             error_count += 1
    #     print("k={},error_count={}".format(k,error_count))
    #     k_error[k] = error_count
    # print(k_error)
    
    best = min(k_error.items(), key=lambda x: x[1])
    return best


if __name__ == '__main__':
    samples_file = "datingTestSet2.txt"
    best = test_knn(samples_file)
    print(best)

Guess you like

Origin blog.csdn.net/weixin_52521533/article/details/123801985