Machine learning basic self-study notes - k-neighbor algorithm

The k-nearest algorithm, the English full name is K Nearest Neighbors, referred to as KNN

concept introduction

The basic principle of the k-neighbor algorithm is actually very simple. Let's look at an example:

Please add a picture description

​ There is a small black spot in the middle of the orange and chartreuse circle. Now the little black spot tells you that it also wants to step aside, but it doesn't know whether to follow the orange circle or the light yellow-green circle, and asks you to help. I believe your first reaction is to see who the little black dot is close to, and whoever it is close to belongs to. After measurement, it is found that the distance between the small black dot and the orange circle is a, and the distance from the light yellow-green circle is b, and a > b a>ba>b . I believe that at this time you will know who the little black dot should belong to. This is the basic principle of the k-neighbor algorithm, and the above picture is also one of the simplest cases.

​ But at this time, Orange Circle was not happy. Why do small black dots belong to light yellow-green? Why does it belong to whoever is close? So he called many of his brothers. So the situation shown below is shown.

Please add a picture description

​ At this time, the closest distance between the orange circle and the small black dot becomes c, and c < b c < bc<b . At this time, the orange circle said, the little black dot should belong to us! The light yellow-green circle is not dry anymore, it thinks it should calculate the average distance from the small black dot to all the circles, and then compare the average size. But helplessly, there are so many people in the orange circle, so the light yellow and green circle also called their brothers.

Please add a picture description

These light yellow and green circles went too far, and ran directly to the orange circle to provoke. After measurement, it is found that the shortest distance from the small black dot to the orange circle and the light yellow-green circle is c. This time it was a quarrel. When adding fuel to the fire, Purple Circle also ran over at this time, also wanting to grab the little black dot.

Please add a picture description

So who should the little black dot belong to at this time?

That's a good question.

There are many judgment criteria here, and for the KNN algorithm, its method is to draw a circle around the small black dot, and then observe which circle in the circle has a larger number, and the small black dot belongs to whom.

Please add a picture description

We adopt the rounding method and count the circles within the blue circle that are larger than half the area of ​​the circle in the garden. Let us count how many circles there are in each of the three colors:

  • Orange: 2

  • Light yellow-green: 6

  • Purple: 9

This statistical process is a voting process. And this vote is by the majority, so in this case, the little black dot should belong to the purple circle. There are a total of 17 small circles in the big blue circle, so k=17 (in general, k will not be as large as 17)

However, the KNN algorithm will also fail sometimes, such as the following situations:

Please add a picture description

The number of purple circles in the blue circle is the same as the number of light yellow-green circles. At this time, it is impossible to determine the ownership of the small black dots.

Compared with other supervised learning methods, the KNN algorithm can be regarded as extremely simple.

KNN algorithm features

(1) Inert. Laziness means that there is no need to do a lot of training on the data first, just put the classified data there. However, like linear regression algorithms, etc., all weight parameters need to be pre-trained before they can be put into classification or prediction.

KNN algorithm implementation steps

1) Samples with categories

Like those circles of various colors in the first example. If there are no circles of various colors (a color is a category), then there is no way to judge which category the "small black dot" belongs to

2) Select the method of measuring distance to measure the distance between samples of unknown category and samples of all known categories

Some people may wonder, isn't the way to measure distance the shortest path between two points? Is there any other way? In fact, the distance here should refer to the generalized distance. Let me introduce several methods of distance.

① Euclidean distance

Euclidean distance is our most common distance measurement method, which is the shortest distance between two points.

Suppose the coordinates of two points are x 1 ( x 11 , x 12 , x 13 , ⋅ ⋅ ⋅ , x 1 n ) x_1(x_{11}, x_{12}, x_{13},···,x_ {1n})x1(x11x12x13⋅⋅⋅x1n) x 2 ( x 21 , x 22 , x 23 , ⋅ ⋅ ⋅ , x 2 n ) x_2(x_{21},x_{22},x_{23},···,x_{2n}) x2(x21x22x23⋅⋅⋅x2 n) , (we also callx 11 , x 12 , x 13 , ⋅ ⋅ ⋅ , x 1 n x_{11}, x_{12}, x_{13},···,x_{1n}x11x12x13⋅⋅⋅x1nfor x 1 x_1x1feature) then the Euclidean distance between these two points is:
L ( x 1 , x 2 ) = ∑ i = 1 n ( x 1 i − x 2 i ) 2 L(x_1,x_2)=\sqrt{\sum_{ i=1}^n{(x_{1i}-x_{2i})^2}}L(x1,x2)=i=1n(x1 ix2 i)2
② Manhattan distance

Suppose the coordinates of two points are x 1 ( x 11 , x 12 , x 13 , ⋅ ⋅ ⋅ , x 1 n ) x_1(x_{11}, x_{12}, x_{13},···,x_ {1n})x1(x11x12x13⋅⋅⋅x1n) x 2 ( x 21 , x 22 , x 23 , ⋅ ⋅ ⋅ , x 2 n ) x_2(x_{21},x_{22},x_{23},···,x_{2n}) x2(x21x22x23⋅⋅⋅x2 n) , then the Manhattan distance of these two points is:
L ( x 1 , x 2 ) = ∑ i = 1 n ∣ x 1 i − x 2 i ∣ L(x_1,x_2)=\sum_{i=1}^ n|x_{1i}-x_{2i}|L(x1,x2)=i=1nx1 ix2 i
③ Chebyshev distance

Suppose the coordinates of two points are x 1 ( x 11 , x 12 , x 13 , ⋅ ⋅ ⋅ , x 1 n ) x_1(x_{11}, x_{12}, x_{13},···,x_ {1n})x1(x11x12x13⋅⋅⋅x1n) x 2 ( x 21 , x 22 , x 23 , ⋅ ⋅ ⋅ , x 2 n ) x_2(x_{21},x_{22},x_{23},···,x_{2n}) x2(x21x22x23⋅⋅⋅x2 n) , then the Chebyshev distance between these two points is:
L ( x 1 , x 2 ) = ( ∑ i = 1 n ∣ x 1 i − x 2 i ∣ p ) 1 p L(x_1,x_2)=( \sum_{i=1}^n|x_{1i}-x_{2i}|^p)^{\frac{1}{p}}L(x1,x2)=(i=1nx1 ix2 ip)p1
where p tends to positive infinity.

In fact, the above three calculation methods can be unified. Euclidean distance is L ( x 1 , x 2 ) = ( ∑ i = 1 n ∣ x 1 i − x 2 i ∣ p ) 1 p L(x_1,x_2)=(\sum_{i=1}^n| x_{1i}-x_{2i}|^p)^{\frac{1}{p}}L(x1,x2)=(i=1nx1 ix2 ip)p1where p = 2 p=2p=2 , the Manhattan distance is the formulap = 1 p=1p=1 , and the Chebyshev distance isp = + ∞ p=+\infinp=+ result

3) Select the k known samples closest to the samples of the unknown category

How to choose the value of k is a science. If the k value is selected too large, it is easy to cause underfitting, and if the k value is selected too small, it will cause overfitting

  • If we assume that k is 1 (that is, the nearest neighbor algorithm), the error rate (possibility of misclassification) at this time is relatively high, because it is easily affected by very few special cases (as shown in the figure below).

insert image description here

In the picture above, the closest to the small black dot is a light yellow-green circle, but it can be clearly seen that this circle is far from other light yellow-green circles, which is an extreme situation, so in fact, this small black dot should be assigned to orange circle

  • If we assume that k is N, that is, judge according to all points, at this time, whichever circle has the most number belongs to which circle, which is obviously unreasonable. The following figure is a counterexample:

Please add a picture description

  • When the value of k is between 1 and N, as k increases, the error rate will first decrease and then increase, and there will be a minimum value point as the most suitable value of k. What value does k take, there is no ready-made algorithm that can tell you, but we can use some skills and understanding to narrow down the range of k values.
    • k generally takes an odd number (especially if there are only two categories), so as to avoid the situation where the proportions of the two categories are equal
    • Generally, the k value is relatively small. You can think of it this way: the larger the k value, it means that there are more circles that need to be considered that are farther away from the small black dot. In fact, the circle that is far away from the small black dot does not have a great influence on the small black dot, so the k value should not be too large.

4) "Voting" according to the minority obeying the majority

is to compare the number of categories

5) Determine which category the unknown classification sample is classified into

KNN algorithm key

(1) All features of the sample should be quantified in a comparable manner.

For example, if there is a non-numeric type in the sample, you need to find a way to convert it into a numeric value. For example, if there is a color in the sample, the color needs to be converted into a gray value for comparison.

(2) Sample features need to be normalized

The sample has multiple parameters, and each parameter has its own domain and value range. In order to "fair competition", the values ​​of all features are normalized.

As shown below:

Please add a picture description

The circles in the left picture are of different sizes, just like each parameter has its own domain and value range, and in order to avoid this effect, it needs to be unified and unified into the right picture.

(3) A distance function is required to calculate the distance between two samples

Commonly used distance functions are: Euclidean distance, cosine distance, Hamming distance, Manhattan distance, etc. Euclidean distance is generally selected as the distance measure, but this is only applicable to continuous variables. In the case of non-continuous variables such as text classification, Hamming distance can be used as a measure. Different measurement methods are suitable for different samples.

Advantages and disadvantages of KNN algorithm

Advantages of the KNN algorithm:

(1) Simple, easy to understand, easy to implement, no need to estimate parameters, no need to train;

This is easy to understand, I also introduced it before, this is called "laziness"

(2) Suitable for classifying rare events;

(3) Especially suitable for multi-classification problems (multi-modal, objects have multiple category labels)

Disadvantages of KNN algorithm:

​ The main disadvantage of the KNN algorithm in classification is that when the samples are unbalanced, such as the sample size of one class is large, while the sample size of other classes is small, it may cause that when a new sample is input, the K The samples of the high-capacity class in the neighbors are in the majority, as shown in the figure below.

insert image description here

In this figure, although it is intuitively felt that the small black dots should belong to the orange circle, if you draw a blue circle as shown above, the number of light yellow-green circles in the blue circle is more than the number of orange circles.

Improvement method:
You can use the weight method (the neighbor with the smaller distance of the sample has a larger weight) to improve.

KNN algorithm optimization - kd tree

​ As mentioned earlier, the KNN algorithm needs to measure the distance between the points to be classified (small black dots) and each classified point (circles of various colors), sort them, select the k circles with the closest distance, and then vote . This algorithm is very simple to understand, but if the data set is very large, such as 1 million data points, and each data point has dozens of features, the calculation will be very complicated. So is there a way to reduce such calculations?

The answer is yes.

At this time, people combined the binary tree in the data structure and proposed an algorithm - kd tree (k-dimensional tree)

The kd tree can help us quickly find the k training points closest to the test point without calculating the distance between each point to be classified and the known classification point. In order to facilitate understanding, we still give an example to illustrate.

Construction of kd tree

Example import

Now we have the following 2D data points: ( 1 , 1 ) , ( 1 , 5 ) , ( 2 , 3 ) , ( 3 , 7 ) , ( 4 , 2 ) , ( 5 , 4 ) , ( 6 , 2 ) , ( 7 , 1 ) , ( 7 , 5 ) (1,1),(1,5),(2,3),(3,7),(4,2),(5,4),(6 ,2),(7,1),(7,5)(1,1),(1,5),(2,3),(3,7),(4,2),(5,4),(6,2),(7,1),(7,5 ) , we want to use these data points to construct a kd tree.

Please add a picture description

In the first step, we base all points on x 1 x_1x1Find the median, the corresponding median in the figure above is 4, so the point corresponding to the median is ( 4 , 2 ) (4,2)(4,2 ) . We go through( 4 , 2 ) (4,2)(4,2 ) Draw a line perpendicular tox 1 x_1x1The vertical line of the axis, then this vertical line (red dotted line) divides all data points into left and right parts.

Please add a picture description

Next, we repeat the above operations for the left and right data points respectively, but it should be noted that at this time, it is based on x 2 x_2 of all pointsx2To calculate the median, the line drawn should be perpendicular to x 2 x_2x2axis.

Note that for this case, the median cannot be found for the four points on the left. At this time, we only need to select any point between the two numbers in the middle.

Please add a picture description

Continue to divide according to the above operation until there is only one point or no point left in the divided small area.

Please add a picture description

In this way, we have completed the division of a kd tree.

So why is it called a "tree"?

In fact, as we divide the process, we will continue to construct a binary tree. We first go through ( 4 , 2 ) (4,2)(4,2 ) Draw a vertical line, then we will( 4 , 2 ) (4,2)(4,2 ) into the root node. Then we draw the vertical line through( 2 , 3 ) (2,3)(2,3) ( 5 , 4 ) (5,4) (5,4 ) These two points, then we take them as the left node and the right node respectively.

Note: The division of left and right nodes is conditional.

For ( 4 , 2 ) (4,2)(4,2 ) , we are based onx 1 x_1x1The value of divides the data points, then the left node x 1 x_1x1The coordinates are smaller than the root node, such as 2<4; right node x 1 x_1x1The coordinates are larger than the root node, such as 5 > 4 5>45>4 . And for( 2 , 3 ) (2,3)(2,3 ) , since we are based onx 2 x_2x2axis to divide the data points, then the left node x 2 x_2x2The coordinates are smaller than the root node, the right node x 2 x_2x2The coordinates are greater than the root node.

and so on. If no vertical line is drawn, then we treat it as an empty node. So we can get a binary tree as shown below.

Please add a picture description

Next, we need to use the kd tree to complete the k nearest neighbor search. Suppose now we have a point ( 3 , 4 ) (3,4)(3,4 ) , we are going to find the nearest neighbor of this point.

Please add a picture description

The process of finding the nearest neighbor is actually a process of searching a binary tree. We can also make an intuitive understanding on the coordinate axes.

For convenience, I named all the points A~E in turn, named the point to be classified as p, stored the k nearest neighbor points in S, and the S function stored k data points (we set k=3 here).

Please add a picture description

Algorithm Description
  1. pick x 1 x_1x1As the coordinate axis, take all the data in the training set x 1 x_1x1The median in the coordinates is used as the cut point to cut the super-rectangular area into two sub-areas. The split point is used as the root node, and the left and right child nodes with a depth of 1 are generated from the root node, and the left node corresponds to x 1 x_1x1The coordinates are smaller than the split point, and the corresponding coordinates of the right node are greater than the split point
  2. For a node of depth j, choose xi x_ixiis the segmentation axis, where i = j ( modk ) + 1 i=j(modk)+1i=j ( m o d k )+1 , take the training data xi x_iin the node areaxiThe median of the coordinates is used as the segmentation point to divide the area into two sub-areas, and generate left and right sub-nodes with a depth of j+1. The left node corresponds to xi x_ixiThe coordinates are smaller than the segmentation point, and the right node corresponds to xi x_ixiCoordinates greater than the cut point
  3. Repeat 2 until there is no data in the two sub-regions or there is only one data and stop.

kd tree search

Example import

The following starts the kd tree search.

  • first step:
  • First we will ( 3 , 4 ) (3,4)(3,4 ) and the root node( 4 , 2 ) (4,2)(4,2 ) Make a comparison. Since it was perpendicular tox 1 x_1x1The axis goes through ( 4 , 2 ) (4,2)(4,2 ) , so to compare two pointsx 1 x_1x1coordinate. Since 3 < 4 3<43<4 , so we search for the left node of the root node (because we stipulated that the left node is small and the right node is large when we constructed the kd tree before).

Please add a picture description

In the above figure, the left coordinate axis corresponds to the area of ​​the shaded part on the left, and the next search will be performed in the shaded part

  • Step two:
  • Next you need to ( 3 , 4 ) (3,4)(3,4 ) x2 x_2x2and the root node of the left subtree ( 2 , 3 ) (2,3)(2,3 ) x2 x_2x2Coordinates are compared since 4 > 3 4>34>3 , so search to the right

Please add a picture description

In the above figure, the left coordinate axis corresponds to the shaded area in the upper left corner, and the next search will be performed in the shaded area.

  • third step:
  • Since the E node has only one child node, the H node can be searched directly without comparing the coordinates when passing through the E node. Since point H is a leaf node, we cannot continue the traversal of this branch, so we mark node H to indicate that we have already visited here and do not need to visit again. At this time, we found that S is empty, and we can temporarily store H point in S as a k-neighbor point (but it does not mean that this point is a neighbor point, and the replacement operation will be performed later)

Please add a picture description

In the above figure, the left coordinate axis corresponds to the shaded area in the upper left corner, and the next search will be performed in the shaded area. In addition, the H node is stored in S

  • the fourth step:
  • Since point H is a leaf node, it is impossible to continue to visit, so it is necessary to return to the previous node C. Because C has only one branch, this branch has been searched, we can mark the node E, and store the E node in S.

Please add a picture description

In the above figure, the left coordinate axis corresponds to the shaded area in the upper left corner, and the shaded area expands, indicating that the next step is to expand the search range to search for areas that have not been searched. In addition, the E node is stored in S; the H node is removed from the kd tree, and no further search is performed afterwards.

  • the fifth step:
  • Since node E has no other branches, continue to return to the previous node B. Mark node B and add it to S.

Please add a picture description

In the above figure, the left coordinate axis corresponds to the shaded area in the upper left corner, and the shaded area expands, indicating that the next step is to expand the search range to search for areas that have not been searched. In addition, the B node is stored in S; the E node is removed from the kd tree, and no longer searches.

  • Step six:
  • Next, calculate the dividing line distance between point p and point B. Since PB ^ = ∣ 4 − 3 ∣ = 1 P\hat{B}=|4-3|=1PB^=∣43∣=1 , this value is less than the maximum distance from point p to the three points H, E, and B in S, so it is necessary to search from node B to another child node D at this time.

Here is an explanation of why the split line distance (the distance from a point to a straight line) is calculated:

Please add a picture description

As shown in the figure above, the node farthest from point p in S is point H, and a circle is drawn with p as the center and pH as the radius. Note that the dividing line distance from p to B is d. According to the positional relationship between the straight line and the circle, when d < r d<rd<When r , parallel to x 1 x_1x1axis and passing through point B cuts the circle. This means that a part of the circle is below point B, and there may be some points below point B inside the circle, so go below point B to search. And this search corresponds to another branch of the search B node in the kd tree.

  • Step seven:
  • Repeat the steps of the second step to search for node D. Since node D is a leaf node, we can no longer search, so we calculate the distance pD = ( 3 − 1 ) 2 + ( 4 − 1 2 ) = 13 pD=\sqrt{(3-1)^2 +(4-1^2)}=\sqrt{13}pD=(31)2+(412) =13 , and compare it to the distances in S. Since p D = 13 > 3 = p H > p E > p B pD=\sqrt{13}>3=pH>pE>pBpD=13 >3=pH>pE>pB , so do not replace the points in S with D (that is, point D cannot be a k-nearest neighbor point).

Please add a picture description

In the above figure, the left coordinate axis corresponds to the shaded area in the lower left corner, which is the area to be searched next.

  • Step eight:
  • Mark point D as the visited node, and return to the previous node to search. It is found that node B has been marked, so continue to return to the previous node. At this time, repeat the operation of the sixth step to calculate the distance of the dividing line from point p to node A p A = 1 pA=1pA=1 , less thanp H , p E , p B pH, pE, pBThe distance of p H , pE , pB still needs to search another branch of the root node A.

Please add a picture description

  • Step Nine:
  • Mark A as a visited node, and set p A pAThe distance of p A is compared with that in S, and it is found thatp A < p H pA < pHpA<p H , so replace A with H. Repeat the previous steps to continue the visit, and finally get the following results:

Please add a picture description

Due to the special situation of this question, p E = p A pE=pApE=p A , choosing one arbitrarily may affect the final classification result, but in the actual situation, the probability of this happening is small, if it happens, it may be necessary to reset the value of k and search again.

In this way, we have completed the establishment and search of the kd tree to obtain k nearest neighbor points.

Maybe you don't think that this method simplifies the calculation, but in fact, for a huge data set, if you can traverse one less subtree, it will bring great convenience, but this situation occurs probabilistically of.

Please add a picture description

As shown in the figure above, if a subtree can be deleted through the kd tree algorithm, then the nodes below the subtree do not need to be traversed. And this traversability comes from the operation of the sixth step, which is to judge whether it is necessary to traverse the subtree by comparing the size of the dividing line.

Algorithm Description
  1. Search downwards according to the coordinates of p and the nodes of the kd tree. If the coordinate of p is less than c, then go to the left child node, otherwise go to the right child node [first step]

  2. When a leaf node is reached, it is marked as visited. If there are less than k points in S, add the node to S **[Step 3]**; if S is not empty and the distance between the current node and point p is less than the longest distance in S, use The current node replaces the point farthest from p in S

  3. If the current node is not the root node, execute (a); otherwise, end the algorithm

(a) Roll back to the parent node of the current node, and the node at this time is the current node (the node after the rollback). Mark the current node as visited, execute (b) and (c) [step 4] ; if the current node has been visited, execute (a) again.

(b) If there are less than k points in S at this time, add the current node to S [step 5] ; if there are already k points in S, and the distance between the current node and point p is less than the most If the distance is long, replace the farthest point in S with the current node.

(c) Calculate the distance between point p and the dividing line of the current node. If the distance is greater than or equal to the farthest distance from p in S and there are already k points in S, execute 3 [Step 7] ; if the distance is less than the farthest distance in S or there are no k points in S, start from the current Another child node of the node starts to execute 1 [Step 8] ; if the current node does not have another child node, execute 3.

Guess you like

Origin blog.csdn.net/m0_61787307/article/details/129504138
Recommended