Machine Learning (1) KNN Algorithm


foreword

k k The k- nearest neighbor method is a basic classification method. kkThe input of the k- nearest neighbor method is the feature vector of the instance, which corresponds to the point of the feature space; the output is the category of the instance. kkThe k- nearest neighbor method actually uses the training data set to divide the feature vector space and serves as the "model" for its classification, so it does not have an explicit learning process. kkk value selection, distance measure and classification decision rule iskkThe k- nearest neighbor method has three basic elements.


One, kkk nearest neighbor algorithm

Input: training data set
T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , ⋯ , ( x N , y N ) } T =\{(x_1,y_1),(x_2,y_2), \cdots,(x_N,y_N)\}T={(x1,y1),(x2,y2),,(xN,yN)}
, wherexi ∈ χ ⊆ R n x_i\in \chi \sube \bm{R}^nxihRn is the feature vector of the instance,yi ∈ Y = { c 1 , c 2 , ⋯ , c K } y_i\in Y= \{c_1,c_2,\cdots, c_K\}yiY={ c1,c2,,cK} is the category of the instance,i = 1 , 2 , ⋯ , N i=1,2,\cdots,Ni=1,2,,N. _
Output: instancexxthe class yy to which x belongsy

  1. According to the given distance measure, in the training set TTT find out withxxx nearest neighborkkk points, covering thiskkxx of k pointsThe neighborhood of x is denoted as N k ( x ) N_k(x)Nk(x);
  2. At N k ( x ) N_k(x)Nk( x ) decides xxaccording to classification decision rules (such as majority voting)categoryyy of xy:
    y = arg max ⁡ c j ∑ x i ∈ N k ( x ) I ( y i = c j ) , i = 1 , 2 , ⋯   , N , j = 1 , 2 , ⋯   , K y=\text{arg}\max_{c_j}\sum_{x_i\in N_k(x)}I(y_i=c_j),i=1,2,\cdots,N,j=1,2,\cdots,K y=argcjmaxxiNk(x)I(yi=cj),i=1,2,,N,j=1,2,,Part K
    ,III is an indicator function, that is, whenyi = cj y_i=c_jyi=cjTime III is 1, noIII is 0.

k k A special case of the k- nearest neighbor method isk = 1 k=1k=1 , it is called the nearest neighbor algorithm.

2. Three basic elements

1. Distance measure

k kThe feature space of the k- nearest neighbor model is generallynnn- dimensional real number vector spaceR n \bm{R}^nRn . The Euclidean distance is used, but other distances are possible. As a more generalL p L_pLpdistance. Define two points in space x 1 = ( x 1 1 , x 1 2 , ⋯ , x 1 n ) T , x 2 = ( x 2 1 , x 2 2 , ⋯ , x 2 n ) T x_1=(x_1^1 ,x_1^2,\cdots,x_1^n)^T,x_2=(x_2^1,x_2^2,\cdots,x_2^n)^Tx1=(x11,x12,,x1n)T,x2=(x21,x22,,x2n)L p L_pof TLpThe distance is defined as
L p ( x 1 , x 2 ) = ( ∑ l = 1 n ∣ x 1 l − x 2 l ∣ p ) 1 p L_p(x_1,x_2)=\left(\sum_{l=1}^ {n}|x_1^l-x_2^l|^p\right)^\frac{1}{p}Lp(x1,x2)=(l=1nx1lx2lp)p1
when p = 2 p=2p=2 , called Euclidean distance, whenp = 1 p=1p=1 , called the Manhattan distance.

2. k k Selection of k value

   k k k is worthy of choice willkkThe results of the k- nearest neighbor method have a significant impact.
  If a smallerkkThe k value is equivalent to using training examples in a smaller neighborhood to make predictions, and the prediction results will be very sensitive to the instance points of the neighbors. If the nearby instance points happen to be noise, the prediction will be wrong. In other words, kkThe reduction of k value means that the overall model becomes complex and prone to overfitting.
  If a largerkkThe k value is equivalent to using training examples in a larger neighborhood to make predictions. At this time, training examples that are far (dissimilar) from the input instance will also have an effect on the prediction, making the prediction wrong. kkThe increase of k value means that the overall model becomes simpler.
  In application,kkThe k value generally takes a relatively small value. Usually, the cross-validation method is used to select the optimal kkk value.

3. Classification decision rules

The classification decision rule here is the majority vote, that is, the kk of the input instanceThe majority class among the k adjacent training strengths determines the class of the input strength.


Three, kd kdk d tree

  Through the above study, we have already understood kkThe specific steps of the k- nearest neighbor algorithm. At this time, you can use programming languages ​​such as python to realizekkk nearest neighbor algorithm.
  However, if the dimensions of the instances in the training set are large and the training data capacity is large, if we calculate the Euclidean distance between the instance points in the training set and the prediction points one by one, and find outkkCalculation is very time-consuming when there are k nearest neighbors, and this method is not feasible. In order to improvekkFor the efficiency of k- nearest neighbor search, you can consider using a special structure to store training data to reduce the number of distance calculations. There are many specific methods, here is one ofkd kdThe method of k- d tree.

1. Construction balance kd kdk d tree

  construct balance kd kdk -d tree.

  1. Start: Construct the root node, the root node contains TTT 'skkA hyperrectangular region of k- dimensional space. selectx 1 x^1x1 is the coordinate axis, withTTx 1 x^1of all instances in TxThe median of the 1 coordinates is the segmentation point, which divides the super-rectangular area corresponding to the root node into two sub-areas.
    Generate left and right child nodes with a depth of 1 from the root node: the coordinates of the left child node arex 1 x^1x1 is smaller than the sub-region of the segmentation point, and the right child node corresponds to the coordinatex 1 x^1x1 for subregions larger than the split point.
    Store the instance points falling on the segmentation hyperplane in the root node.
  2. Repeat: for depth jjThe node of j , select xlx^lxl is the split axis,l = j ( mod k ) + 1 l=j(\text{mod}k)+1l=j ( mod k )+1 , take the xlx^lof all instances in the region of the nodexThe median of l coordinates is the segmentation point, and the super-rectangular area corresponding to the node is divided into two sub-areas.
    The depth generated by this node isj + 1 j+1j+The left and right child nodes of 1 : the left child node corresponds to the coordinates xlx^lxl is smaller than the sub-region of the segmentation point, the right child node corresponds to the coordinatexlx^lxl is greater than the subregion of the cut point.
    Save the instance points falling on the segmentation hyperplane in this node.
  3. Instructing that there are no instances in the two sub-regions is to stop, thus forming kd kdRegion partitioning of k- d trees.

We directly use an example to show how to construct kd kdk -d tree.
Given a two-dimensional training data set
T = { ( 2 , 3 ) T , ( 5 , 4 ) T , ( 9 , 6 ) T , ( 4 , 7 ) T , ( 8 , 1 ) T , ( 7 , 2 ) T } T=\{(2,3)^T,(5,4)^T,(9,6)^T,(4,7)^T,(8,1)^T,( 7,2)^T\}T={(2,3)T,(5,4)T,(9,6)T,(4,7)T,(8,1)T,(7,2)T }
Construct a balancedkd kdk -d tree.

  1. In the first step, take x 1 x^1x1 is the dividing axis,x 1 x^1xThe median in dimension 1 is 7, so we split the data with 7 as the split point. As shown in the figure below:
    insert image description here
    At this point( 7 , 2 ) (7,2)(7,2 ) Points are stored at the root node (depth is 0), there are three points in the left area, and two points in the right area.
  2. In the second step, the depth of the node is 1 at this time, l = 1 ( mod 2 ) + 1 = 2 l=1(mod 2)+1=2l=1 ( m o d 2 )+1=2 , choosex 2 x^2x2 is the coordinate axis of segmentation. x 2 x^2in the left areaxThe median in 2 dimensions is 4, x 2 x^2xThe median in the 2- dimension is 6, and the left and right regions are segmented accordingly. As shown in the figure below:kd kd
    insert image description here
    at this timeA k- d tree looks like this:
    insert image description here
  3. In the third step, the depth of the node at this time is 2, l = 2 ( mod 2 ) + 1 = 1 l=2(mod 2)+1=1l=2 ( m o d 2 )+1=1 , selectx 1 x^1x1 is the coordinate axis of segmentation. After splitting, it is shown in the figure below:
    insert image description here
    The kd tree is shown in the figure below:
    insert image description here

So far we have found that there are no instances in all areas (that is to say, there are no instance points to divide), so stop kd kdRegion partitioning of k- d trees.

2. Use kd kdk- d tree for nearest neighbor search

In the previous section, we have introduced how to construct kd kdk d tree, in this section we introduce how to usekd kdk- d tree for nearest neighbor search.
The algorithm is as follows:
Input: constructedkd kdk d tree, target pointxxx .
Output:xxThe nearest neighbor of x .

  1. Starting from the root node, recursively visit kd kd downk -d tree. If the target pointxxIf the coordinates of the current dimension of x are smaller than the coordinates of the segmentation point, move to the left child node, otherwise move to the right child node until the child node is a leaf node.
  2. Let this leaf node be the "current closest point" (compute point xxx distance from this leaf node).
  3. Back up recursively, at each node (denoted as ooo ) Do the following:
    • (a) If the node ooo Saved instance point with pointxxThe distance of x is greater than the distance from the current closest point to pointxxx is closer, pointooo as the "current closest point".
    • (b) The current closest point must exist in the node oooA region corresponding to a child node. Check the parent nodeooWhether there is a closer point in the area corresponding to another child node of o . Specifically, check whether the area corresponding to another child node is the same as the target pointxxx is the center of the sphere, and the hypersphere intersects with the distance between the target point and the "current closest point" as the radius. If it intersects, there may be a point closer to the target point in the area corresponding to another child node, and move to another child node. Next, recursively perform the nearest neighbor search, and if they do not intersect, back up.
  4. When returning to the root node, the search ends. The final "current closest point" is xxThe nearest neighbor of x .

We use an example to demonstrate kd kdThe process of performing nearest neighbor search on k d tree. The target point isx ( 2 , 4.5 ) x(2,4.5)x(2,4.5 ) , to findxxThe nearest neighbor of x .
According to the algorithm flow, after the first step, we move to( 4 , 7 ) (4,7)(4,7 ) point. The movement path is:( 7 , 2 ) − ( 5 , 4 ) − ( 4 , 7 ) (7,2)-(5,4)-(4,7)(7,2)(5,4)(4,7 ) .
The second step: let the leaf node( 4 , 7 ) (4,7)(4,7 ) is the current nearest pointooo . Target point and closest pointooThe distance of o is 3.202.
The third step: recursively back up, back to( 5 , 4 ) (5,4)(5,4 ) point. (a) The distance between the target point and (5,4) is: 3.041, so( 5 , 4 ) (5,4)(5,4 ) The point is closest to the current pointoooDistance from target pointxxx is closer, then update the current closest pointooo is( 5 , 4 ) (5,4)(5,4 ) . (b) Take the target pointxxx is the center of the circle, and the target point and the "current closest pointooo "The distance between the radius circle and the current closest pointo ( 5 , 4 ) o(5,4)o(5,4 ) another child node( 2 , 3 ) (2,3)(2,3 ) The corresponding regions intersect. So we move to( 2 , 3 ) (2,3)(2,3 ) points,( 2 , 3 ) (2,3)(2,3 ) The distance between the point and the target point is o ( 5 , 4 ) o(5,4)than the current closest pointo(5,4 ) is closer, so the nearest point o is updated to( 2 , 3 ) (2,3)(2,3 ) . Then continue to go back up to( 7 , 2 ) (7,2)(7,2 ) points,(7, 2) (7,2)(7,2 ) Point distance from target point ratio( 2 , 3 ) (2,3)(2,3 ) The point is far away from the target point.
Step 4: Go back to the root node, and the search ends. At this point the nearest pointo ( 2 , 3 ) o(2,3)o(2,3 ) is the target pointx ( 2 , 4.5 ) x(2,4.5)x(2,4.5 ) nearest neighbor point.

The overall idea is roughly;

  1. First perform a binary search to find the leaf node
  2. backtracking search path
  3. Check if another child node region has a closer
  4. search another subspace
  5. Finally backtrack to the root node

4. Reference link

Code link: https://github.com/zgaxddnhh/lihang-code
Reference books: Machine Learning Methods. Li Hang

Guess you like

Origin blog.csdn.net/qq_41596730/article/details/127770363