Machine Learning Notes 01--K Nearest Neighbor Algorithm Principle and Practice

Table of contents

1. What is the K nearest neighbor algorithm?

2. Film classification practice


        Introduction: Before learning machine learning, it was quite chaotic, and there was no systematic summary. During this time, I plan to re-learn machine learning. Reference books--machine learning combat, machine learning watermelon book, here I will digest according to my own absorption Summarize to further deepen understanding.

It is not easy to summarize, please indicate the source or link of this article, thank you.

Let's get to the point:

1. What is the K nearest neighbor algorithm?

        I would like to give an example first: Suppose there are three people, here we call them A, B, and C; these three people are on the same plane, and each of them is in a different position; then comes the fourth person D , this person is on the same plane as these three people, our goal now is to calculate who is closest to this person D and these three people A, B, C, then we need to calculate the distance between them respectively, and then Sort to select the smallest distance, then we can say who D is closest to. How to calculate the distance, suppose we use coordinates to represent their positions, A(x1, y1), B(x2, y2), C(x3, y3), D(x4, y4). Then we can use the square of the difference of the abscissa xi plus the square of the difference of the ordinate and then take the square root, which is what we call the Euclidean distance formula. If they are not on a plane, but in a space, then adding one more coordinate element becomes: A(x1, y1, z1), B(x2, y2, z2), C(x3, y3, z3), D(x4, y4, z4). It can also be expanded to multi-dimensional. When multi-dimensional, the corresponding coordinates are subtracted, squared, and then summed and squared. Calculated as follows.

On a two-dimensional plane: Euclidean distance formula:

In three-dimensional space: Euclidean distance formula:

        So much has been said above, some people may think that this has been learned in high school, and it has nothing to do with the K-nearest neighbor algorithm. In fact, I think the K-nearest neighbor algorithm is based on the calculation distance above.

Popular definition: Classification is performed by measuring the distance between corresponding feature values ​​between different samples .

        Then the coordinate values ​​​​of the four people in the above example A, B, C, and D represent their respective characteristic values. Assuming that they are still on the same plane, that is, the coordinates of each person are composed of x and y. Here we regard x and y as the two eigenvalues ​​of each of them, and each person is a separate sample. Below we put People are called samples. Assuming that the three people A, B, and C are still in different positions, these three people represent the three categories a, b, and c respectively.

        Assuming that all of our current samples (D, E, F,...etc.), as long as they are located in this plane, our purpose is to divide this sample into one of a, b, c categories, that is, to infer Which category of a, b, and c does the sample we newly added to this plane belong to. We still calculate the distance from each sample to the three samples A, B, and C, and then take out the top K smallest distances, that is, after sorting the distances from small to large, take out the top k smallest distances. This is the K nearest neighbor algorithm . The origin of K.

        Then, each distance corresponds to the category of a sample, we look at the category of the first k samples, the category with the most occurrences, and we classify the samples we newly added to the plane as this The category with the most occurrences. For example: we now have a sample D whose distances to A, B, and C are: d1, d2, d3; after sorting in ascending order: d2, d1, d3, we want to take the first k minimum If k=1, that is, we take the smallest distance as d2, d2 is the distance from sample D to sample B we calculated, and the category of sample B is b, so we infer here: the category to which sample D belongs for b.

2. Film classification practice

        Let's give a simple example to realize it. This is an example of machine learning in practice. I think it is very good, and it can better understand the paragraph I said above.

        Hypothesis: We are now going to do a movie classification task, and judge whether the movie is a romance movie or an action movie based on the number of times the fighting plot and the kissing plot appear in the movie clip. The data is as follows: each row represents two features and a label (category) corresponding to a sample, just like the coordinates of a person mentioned above are features, and there is also a corresponding label.

movie name Feature 1 (Fight Segment) Feature 2 (kissing segment) label (category)
movie 1 3 87 Romance
movie 2 2 94 Romance
movie 3 4 102 Romance
movie 4 100 3 Action movie
movie 5 99 4 Action movie
movie 6 89 5 Action movie

        Note: These numbers are edited by individuals and are not the basis for real movie classification. What we are doing is just a hypothesis, and I will take you to understand the K-next-adjacent algorithm.

        From the above data, we can see that each movie has two feature values, and then corresponds to a label, which is the category of the movie it belongs to. According to these data, when we are given a movie 7, only two feature values ​​of the movie 7 are given, and then we predict the category to which the movie 7 belongs. We can write a program to achieve, the specific idea is as follows:

①, formulate our dataset dataset

②. Calculate the distance of the eigenvalues ​​between movie 7 and other movies

③. Sort the distances and take out the first k smallest distances

④. Calculate the category with the most occurrences in the first k smallest distances

5. Deduce the category to which movie 7 belongs according to the first four steps.

Let's implement it below, and the code is commented in detail:

# 引入库
import numpy as np
# 第一步:根据现有的数据制作我们的数据集
def createDataset():
    # 每个样本占一行,每一行有三列,我们把爱情片和动作片当作0和1来处理,
    dataset = np.array([[3, 87, 0],
                        [2, 94, 0],
                        [4, 102, 0],
                        [100, 3, 1],
                        [99, 4, 1],
                        [89, 5, 1]])
    # 创建好数据集之后,我们需要分别获得特征值和标签
    features = dataset[:, :2]
    labels = dataset[:, -1]
    #打印标签和特征进行查看
    print('features=', features)
    print('labels=', labels)
    return features, labels
# 定义一个字典便于我们取出数字化之后的标签对应的真实信息
labels_true = {'0':'爱情片', '1':'动作片'}
# 定义一个函数用于生成我们想要测试的数据
def testData():
    # 创建一个数组用于存储两个特征值, 这里我们生成的行数等于上面已知的样本的数量,其每一行的值是相等的
    testdata = np.zeros((6, 2), dtype=np.uint8)
    # 输入第一个特征值
    feature1 = input('请输入打斗场景次数:')
    # 添加到第一列
    testdata[:, 0] = feature1
    # 输入第二个特征值
    feature2 = input('请输入亲吻场景次数:')
    # 添加到第二列
    testdata[:, 1] = feature2
    return testdata
# 定义一个函数,用于计算预测样本和一直样本之间的距离
def calculateDistance(features, testdata):
    # 对应特征值先做减法
    distance_mat = testdata - features
    print('做减法之后=', distance_mat)
    # 做差之后进行平方
    distance_mat = distance_mat ** 2
    print('平方之后等于=', distance_mat)
    # 对每一行进行求和,即可得到当前预测样本到各个样本特征值的距离
    distance_mat = distance_mat.sum(axis = 1)
    print('对每一行求和=', distance_mat)
    return distance_mat
# 对距离进行排序,然后取出前k个最小距离
# 统计前k个中哪个类别出现的次数最多
def get_class(distance, features, labels, k=3):
    # 先按照距离由小到大进行排序
    distance_sort = np.sort(distance)
    print(distance_sort)
    # 用于存储前k个最小距离所在样本数据的标签
    list1 = []
    for i in range(len(distance_sort)):
        # 获得排序后的每一个元素
        value = distance_sort[i]
        # 找到其在排序之前的数组中的下标
        index = np.where(distance==value)
        # 将排序后的每一个元素在原始列表中的下标保存在一个列表中,用于后面找到其所属的类别
        list1.append(index[0][0])
    print(list1)
    # 样本对应的标签有两种,爱情片--0,动作片--1
    label1 = 0
    label2 = 1
    # 记录每一个类别出现的次数
    cn1 = 0
    cn2 = 0
    # 查看其排序后的前k个元素所属的类别
    for i in list1[:k]:
        if labels[i] == label1:
            cn1 += 1
        else:
            cn2 += 1
    print(cn1, cn2)
    # 判断哪个类别出现的多
    if cn2 > cn1:
        # labels_true是上面定义的字典,方便我们取出其真正的类别标签
        print(labels_true[str(label2)])
    else:
        print(labels_true[str(label1)])

if __name__ == '__main__':
    # 调用createDataset()函数,获得我们人工创造的数据集的样本特征和标签
    features, labels = createDataset()
    # 调用testData()函数,生成测试样本特征
    testdata = testData()
    # 调用calculateDistance(features, testdata)函数,计算测试样本到人工创造样本的距离
    distance = calculateDistance(features, testdata)
    # 调用get_class(distance, features, labels)函数,推断测试样本所属类别
    get_class(distance, features, labels)

The result of the operation is as follows:

        Although the significance of this data set is not very great, it shows the usage of K-nearest neighbor algorithm very well.

        The K-nearest neighbor algorithm seems relatively simple and crude. It is to calculate the distance from the current test sample to each known sample, then take the first k smallest distances, calculate the number of occurrences of each category of the first k smallest distances, and then find out the number of occurrences The category with the most, that is, the category of our current test sample is also found.

        Inadequacies of the K-nearest neighbor algorithm: If a sample has multiple features, and the value of each feature is relatively large, this will lead to low computational efficiency and high computational complexity. From the above example, we can see that the algorithm is based on the data set, so if the data set is huge, the performance of this simple and crude algorithm does not seem to be very good. This leads to the normalization method, as well as other algorithms.

Guess you like

Origin blog.csdn.net/BaoITcore/article/details/125112475