Machine Learning Fundamentals: K Nearest Neighbor Algorithm (Machine Learning Fundamentals: KNN)

basic concept

The k-nearest neighbor (k-NN) method is a basic classification and regression method proposed by Cover T and Hart P in 1967.

The basic concept is as follows: there is a sample data set, all feature attributes are known, and each object in the sample set has a known category. For the object to be tested that does not know the classification, each feature attribute of the object to be tested is compared with the corresponding feature attribute of the data in the sample set, and then the algorithm extracts the classification label of the most similar object (nearest neighbor) of the sample. Generally speaking, we only select the top k most similar object data in the sample data set, which is the source of k in the k-nearest neighbor algorithm, usually k is an integer not greater than 20. Finally, according to the characteristics and attributes of the k data, the classification of the data to be tested is judged

Right now:一个样本与数据集中的k个样本最相似, 如果这k个样本中的大多数属于某一个类别, 则该样本也属于这个类别。

insert image description here

distance function

Manhattan distance (L1 norm)

Manhattan distance is also called "Manhattan block distance". Imagine you are on the streets of Manhattan, driving from one intersection to another intersection, the driving distance is this "Manhattan distance".

  • Two points a ( x 1 , y 1 ) , b ( x 2 , y 2 ) a(x_1,y_1),b(x_2,y_2) on a two-dimensional planea(x1,y1),b(x2,y2) between the Manhattan distance formula:
    dab = ∣ x 1 − x 2 ∣ + ∣ y 1 − y 2 ∣ d_{ab}=\left|x_{1}-x_{2}\right|+\left|y_ {1}-y_{2}\right|dab=x1x2+y1y2
  • Two points a ( x 1 , x 2 . . . xn ) , b ( y 1 , y 2 . . . yn ) a(x_1,x_2 ... x_n), b(y_1,y_2 ... y_n)a(x1,x2...xn),b(y1,y2...yn) Manhattan distance formula:
    dab = ∣ x 1 − y 1 ∣ + ∣ x 2 − y 2 ∣ + … . . + ∣ xn − yn ∣ d_{ab}=\left|x_{1}-y_{1} \right|+\left|x_{2}-y_{2}\right|+\ldots . .+\left|x_{n}-y_{n}\right|dab=x1y1+x2y2+..+xnyn

Euclidean distance (L2 norm)

  • Two points a ( x 1 , y 1 ) , b ( x 2 , y 2 ) a(x_1,y_1),b(x_2,y_2) on a two-dimensional planea(x1,y1),b(x2,y2) between Euclidean distance formula:
    dab = ( x 1 − x 2 ) 2 + ( y 1 − y 2 ) 2 d_{ab}=\sqrt{\left(x_{1}-x_{2}\right )^{2}+\left(y_{1}-y_{2}\right)^{2}}dab=(x1x2)2+(y1y2)2

  • Two points a ( x 1 , x 2 . . . xn ) , b ( y 1 , y 2 . . . yn ) a(x_1,x_2 ... x_n), b(y_1,y_2 ... y_n)a(x1,x2...xn),b(y1,y2...yn)的欧氏距离公式:
    d a b = ( x 1 − y 1 ) 2 + ( x 2 − y 2 ) 2 + … + ( x n − y n ) 2 d_{a b}=\sqrt{\left(x_{1}-y_{1}\right)^{2}+\left(x_{2}-y_{2}\right)^{2}+\ldots+\left(x_{n}-y_{n}\right)^{2}} dab=(x1y1)2+(x2y2)2++(xnyn)2

Chebyshev distance ( ∞ \infin norm)

  • Two points a ( x 1 , y 1 ) , b ( x 2 , y 2 ) a(x_1,y_1),b(x_2,y_2) on a two-dimensional planea(x1,y1),b(x2,y2) between Chebyshev distance formula:
    dab = max ⁡ ( ∣ x 1 − x 2 ∣ , ∣ y 1 − y 2 ∣ ) d_{ab}=\max \left(\left|x_{1}-x_ {2}\right|,\left|y_{1}-y_{2}\right|\right)dab=max(x1x2,y1y2)

  • Two points a ( x 1 , x 2 . . . xn ) , b ( y 1 , y 2 . . . yn ) a(x_1,x_2 ... x_n), b(y_1,y_2 ... y_n)a(x1,x2...xn),b(y1,y2...yn)的切比雪夫距离公式:
    d a b = max ⁡ ( ∣ x 1 − y 1 ∣ , ∣ x 2 − y 2 ∣ , … ∣ x n − y n ∣ ) d_{a b}=\max \left(\left|x_{1}-y_{1}\right|,\left|x_{2}-y_{2}\right|, \ldots\left|x_{n}-y_{n}\right|\right) dab=max(x1y1,x2y2,xnyn)

Angle cosine

  • Two points a ( x 1 , y 1 ) , b ( x 2 , y 2 ) a(x_1,y_1),b(x_2,y_2) on a two-dimensional planea(x1,y1),b(x2,y2) between the angle cosine formula:
    cos ⁡ Θ = x 1 ∗ x 2 + y 1 ∗ y 2 x 1 2 + y 1 2 ∗ x 2 2 + y 2 2 \cos \Theta=\frac{x_{1 } * x_{2}+y_{1} * y_{2}}{\sqrt{x_{1}^{2}+y_{1}^{2}} * \sqrt{x_{2}^{2 }+y_{2}^{2}}}cosTh=x12+y12 x22+y22 x1x2+y1y2

  • Two points a ( x 1 , x 2 . . . xn ) , b ( y 1 , y 2 . . . yn ) a(x_1,x_2 ... x_n), b(y_1,y_2 ... y_n)a(x1,x2...xn),b(y1,y2...yn)的夹角余弦公式:
    cos ⁡ Θ = x 1 ∗ y 1 + x 2 ∗ y 2 + … + x n ∗ y n x 1 2 + x 2 2 + … + x n 2 ∗ y 1 2 + y 2 2 + … + y n 2 \cos \Theta=\frac{x_{1} * y_{1}+x_{2} * y_{2}+\ldots+x_{n} * y_{n}}{\sqrt{x_{1}^{2}+x_{2}^{2}+\ldots+x_{n}^{2}} * \sqrt{y_{1}^{2}+y_{2}^{2}+\ldots+y_{n}^{2}}} cosTh=x12+x22++xn2 y12+y22++yn2 x1y1+x2y2++xnyn

KD-Tree

Since the KNN algorithm needs to calculate the nearest K samples in the test sample and the training sample, the time cost of the linear search is too large. Therefore, the commonly used search method is KD-Tree, whose basic principle is similar to binary search tree (BST), except that the feature dimension is higher. The schematic diagram is as follows:

insert image description here

The details are not mentioned here, you can refer to [1][2].

The KD-Tree algorithm is provided in sklearn.

Advantages and disadvantages

advantage

  • Simple and easy to use, easy to understand, high precision, mature theory, can be used for both classification and regression;
  • Can be used for non-linear classification
  • Can be used for numerical data and discrete data (both for valuation and classification)
  • Training time complexity is O(n); no data input assumptions;
  • Insensitive to outliers.
  • High accuracy, no assumptions about data, not sensitive to outlier;

shortcoming

  • High computational complexity; high space complexity; requires a lot of memory
  • Sample imbalance problem (i.e. some classes have a large number of samples while others have few samples);
  • Generally, this is not used when the value is large, and the amount of calculation is too large. But a single sample should not be too small, otherwise it is easy to misclassify.
  • The biggest disadvantage is that it cannot give the intrinsic meaning of the data.

think

references

  1. The principle of KD Tree and its implementation in Python
  2. Explain KDTree in detail
  3. Detailed Explanation of KNN (k Nearest Neighbor) Algorithm for Machine Learning

おすすめ

転載: blog.csdn.net/xuyangcao123/article/details/115190491