k-NN algorithm

table of Contents

1>k-NN algorithm

2>Use OpenCV to implement k-NN algorithm


1>k-NN algorithm

Understand the k-NN algorithm:

The k-nearest neighbor algorithm is a basic classification and regression method. At present, we only discuss the k-nearest neighbor algorithm for classification problems. That is, given a training data set, for a new input instance, find the k instances closest to the instance in the training data set. If most of these k instances belong to a certain class, then classify the input instance into this class This is similar to the idea that the minority obeys the majority in real life.

As shown in the figure above, there are two different types of sample data, represented by blue squares and red triangles, and the data identified by the green dots are the data to be classified. How to classify the green dots is a problem that the k-NN algorithm needs to study.

Below we classify the green dots according to the idea of ​​k-nearest neighbor algorithm:

  • If k=3, the three nearest points of the green dot are 2 red triangles and 1 blue square. The minority is subordinate to the majority. Based on statistical methods, it is determined that the green point to be classified belongs to the red triangle.
  • If k=5, the 5 points closest to the green dot are 2 red triangles and 3 blue squares. The minority belongs to the majority. Based on statistical methods, it is determined that the green point to be classified belongs to the blue square. .

As can be seen from the above example, the idea of ​​the k-nearest neighbor algorithm is very simple, as long as you find the k nearest instances to it, select the most categories and classify them.

2>Use OpenCV to implement k-NN algorithm

In order to facilitate understanding, we first set the background of the training data:

The data points are houses in the town map, and each data point has two characteristics:

  • The coordinates (x,y) on the map .
  • A category label : the category of the red triangle is 1, and the category of the blue square is 0.

Open a new IPython session:

ipython

Introduce all necessary modules:

import numpy as np
import cv2
import matplotlib.pyplot as plt
%matplotlib
plt.style.use('ggplot')

The seed value of the fixed random number generator:

np.random.seed(42)

Assuming that the range of the town map is 0≤x<100 and 0≤y<100, randomly select a location on the map:

single_data_point=np.random.randint(0, 100, 2)
single_data_point
#结果:array([51, 92])

This code will get 2 random integers from 0 to 100, using the first integer as the x-coordinate value of the data point on the map, and the second integer as the y-coordinate value of the data point on the map. .

Choose a label for this data point:

single_label=np.random.randint(0, 2)
single_label
#结果:0

The result indicates that the category of this data point is 0, and we treat it as a blue square.

Packaging this process into a function, the input is the number of data points to be generated (num_samples) and the number of features of each data point (num_features):

def generate_data(num_samples, num_features=2):
    #我们想要创建的数据矩阵应该有num_samples行、num_features列,其中每一个元素都应该是[0,100]范围内的一个随机整数:
    data_size=(num_samples, num_features)
    train_data=np.random.randint(0, 100, size=data_size)

    #创建一个所有样本在[0,2]范围内的随机整数标签值的向量:
    labels_size=(num_samples, 1)
    labels=np.random.randint(0, 2, size=labels_size)

    #让函数返回生成的数据:
    return train_data.astype(np.float32), labels

To test the function, first generate any number of data points, for example, 11 data points, and randomly select their coordinates:

train_data, labels=generate_data(11)
train_data
'''
结果:
array([[71., 60.],
       [20., 82.],
       [86., 74.],
       [74., 87.],
       [99., 23.],
       [ 2., 21.],
       [52.,  1.],
       [87., 29.],
       [37.,  1.],
       [63., 59.],
       [20., 32.]], dtype=float32)
'''

From the above results, we can see that the train_data variable is an 11×2 array, and each row represents a single data point. You can use the array index to get the first data and its corresponding label:

train_data[0], labels[0]
#结果:(array([71., 60.], dtype=float32), array([1]))

This result tells us: the first data point is a red triangle (because its category is 1), and its coordinate position on the town map is (x,y)=(71,60).

You can use Matplotlib to plot this data point on the town map:

plt.plot(train_data[0, 0], train_data[0, 1], 'sb')
plt.xlabel('x coordinate')
plt.ylabel('y coordinate')

Get the following drawing output result:

If you want to display all the training data sets at once, you can do this by writing a function:

def plot_data(all_blue, all_red):
    plt.figure(figsize=(10, 6))

    #把所有蓝色数据点用蓝色正方形画出来(使用颜色'b'和标记's'),并把蓝色数据点当作N×2的数组传入,其中N是样本的数量。
    #all_blue[:, 0]包含了所有蓝色数据点的x坐标,all_blue[:, 1]包含了所有蓝色数据点的y坐标:
    plt.scatter(all_blue[:, 0], all_blue[:, 1], c='b', marker='s', s=180)

    #红色数据点与蓝色同理:
    plt.scatter(all_red[:, 0], all_red[:, 1], c='r', marker='^', s=180)

    #设置绘图的标签:
    plt.xlabel('x coordinate (feature 1)')
    plt.ylabel('y coordinate (feature 2)')

The function input includes a list of all blue square data points (all_blue) and a list of all red triangle data points (all_red).

To test this function, you first need to divide all data points into red data points and blue data points. The following command can quickly select all elements equal to 0 in the labels array:

labels.ravel()==0
'''
结果:
array([False, False, False,  True, False,  True,  True,  True,  True,
        True, False])
'''

The rows with label 0 in the train_data created earlier are the blue data points:

blue=train_data[labels.ravel()==0]

The same can be done for all red data points:

red=train_data[labels.ravel()==1]

Finally, draw all the data points:

plot_data(blue, red)

Get the following drawing output result:

Create a new classifier:

knn=cv2.ml.KNearest_create()

Pass the training data into the train method:

knn.train(train_data, cv2.ml.ROW_SAMPLE, labels)
#结果:True

Our data is an N×2 array (that is, each row is a data point), this function will return True after successful execution.

A very useful method provided by knn is called findNearest, which can predict the label of a new data point based on the label of the nearest neighbor data point:​​​

newcomer, _=generate_data(1)
newcomer
#结果:array([[91., 59.]], dtype=float32)

The generate_data function returns a random category, but we are not interested in it. You can pass an underscore to make Python ignore the output value.

Back to our town map, we have to draw the training data set as before, and add new data points, which are represented by green circles:

plot_data(blue, red)
plt.plot(newcomer[0, 0], newcomer[0, 1], 'go', markersize=14);

You can add a semicolon after the plt.plot function to suppress the output, just like Matlab.

Get the following drawing output result:

In the case of k=1, the result predicted by the classifier:

ret, results, neighbor, dist=knn.findNearest(newcomer, 1)
print("Predicted label:\t", results)
print("Neighbor's label:\t", neighbor)
print("Distance to neighbor:\t", dist)
'''
结果:
Predicted label:         [[1.]]
Neighbor's label:        [[1.]]
Distance to neighbor:    [[250.]]
'''

From the above results, knn reports that the nearest neighbor point is 250 units away, and its category is 1 (red triangle), so the category of the new data point should also be 1.

If the search window is greatly expanded and the new data points are classified according to k=7 nearest neighbors, the result predicted by the classifier:

ret, results, neighbor, dist=knn.findNearest(newcomer, 7)
print("Predicted label:\t", results)
print("Neighbor's label:\t", neighbor)
print("Distance to neighbor:\t", dist)
'''
结果:
Predicted label:         [[0.]]
Neighbor's label:        [[1. 1. 0. 0. 0. 1. 0.]]
Distance to neighbor:    [[ 250.  401.  784.  916. 1073. 1360. 4885.]]
'''

From the above results, we can see that the predicted label becomes 0 (blue square). This is because only 3 neighbors in the current range have a label of 1 (red triangle), and the other 4 neighbors have a label of 0 (blue square), so the newcomer is predicted to be a blue square.

Guess you like

Origin blog.csdn.net/Kannyi/article/details/113270364