Article directory

foreword
One, $k$ nearest neighbor algorithm
2. Three basic elements
Three, $k d$ tree
- 1. Construction balance $k d$ tree
- 2. Use $k- d$ tree for nearest neighbor search
4. Reference link

foreword

$The k-$ nearest neighbor method is a basic classification method. $The input of the k-$ nearest neighbor method is the feature vector of the instance, which corresponds to the point of the feature space; the output is the category of the instance. $The k-$ nearest neighbor method actually uses the training data set to divide the feature vector space and serves as the "model" for its classification, so it does not have an explicit learning process. $k$ value selection, distance measure and classification decision rule is $The k-$ nearest neighbor method has three basic elements.

One, $k$ nearest neighbor algorithm

Input: training data set
$=\{(x_1,y_1),(x_2,y_2), \cdots,(x_N,y_N)\}$
, where $x_i\in \chi \sube \bm{R}^n$ is the feature vector of the instance, $y_i\in Y= \{c_1,c_2,\cdots, c_K\}$ is the category of the instance, $i=1,2,\cdots,N$ _
Output: instance $to which x$ belongs $y$ 。

According to the given distance measure, in the training set $T$ find out with $x$ nearest neighbor $k$ points, covering this $of k$ pointsThe neighborhood of $x$ $N_k(x)$ ;
At $N_k(x)$ according to classification decision rules (such as majority voting)category $of x$ $y$ :
$y=\text{arg}\max_{c_j}\sum_{x_i\in N_k(x)}I(y_i=c_j),i=1,2,\cdots,N,j=1,2,\cdots,K$
, $I$ is an indicator function, that is, when $y_i=c_j$ Time $I$ is 1, no $I$ is 0.

$A special case of the k-$ nearest neighbor method is $k = 1$ , it is called the nearest neighbor algorithm.

2. Three basic elements

1. Distance measure

$The feature space of the k-$ nearest neighbor model is generally $n-$ dimensional real number vector space $\bm{R}^n$ . The Euclidean distance is used, but other distances are possible. As a more general $L_p$ distance. Define two points in space $x_1=(x_1^1 ,x_1^2,\cdots,x_1^n)^T,x_2=(x_2^1,x_2^2,\cdots,x_2^n)^T$ $L_p$ of $^{T}$ $L_{p}$ The distance is defined as
$L_p(x_1,x_2)=\left(\sum_{l=1}^ {n}|x_1^l-x_2^l|^p\right)^\frac{1}{p}$
when $p = 2$ , called Euclidean distance, when $p = 1$ , called the Manhattan distance.

2. $Selection of k$ value

$k$ is worthy of choice will $The results of the k-$ nearest neighbor method have a significant impact.
If a smaller $The k$ value is equivalent to using training examples in a smaller neighborhood to make predictions, and the prediction results will be very sensitive to the instance points of the neighbors. If the nearby instance points happen to be noise, the prediction will be wrong. In other words, $The reduction of k$ value means that the overall model becomes complex and prone to overfitting.
If a larger $The k$ value is equivalent to using training examples in a larger neighborhood to make predictions. At this time, training examples that are far (dissimilar) from the input instance will also have an effect on the prediction, making the prediction wrong. $The increase of k$ value means that the overall model becomes simpler.
In application, $The k$ value generally takes a relatively small value. Usually, the cross-validation method is used to select the optimal $k$ value.

3. Classification decision rules

The classification decision rule here is the majority vote, that is, $The majority class among the k$ adjacent training strengths determines the class of the input strength.

Three, $k d$ tree

Through the above study, we have already understood $The specific steps of the k-$ nearest neighbor algorithm. At this time, you can use programming languages such as python to realize $k$ nearest neighbor algorithm.
However, if the dimensions of the instances in the training set are large and the training data capacity is large, if we calculate the Euclidean distance between the instance points in the training set and the prediction points one by one, and find out $Calculation is very time-consuming when there are k$ nearest neighbors, and this method is not feasible. In order to improve $For the efficiency of k-$ nearest neighbor search, you can consider using a special structure to store training data to reduce the number of distance calculations. There are many specific methods, here is one of $The method of k- d$ tree.

1. Construction balance $k d$ tree

construct balance $k -d$ tree.

Start: Construct the root node, the root node contains $T$ 'sA hyperrectangular region of $k- dimensional space.$ select $x^1$ is the coordinate axis, with $x^1$ of all instances in $T$ $x^{The median of the 1}$ coordinates is the segmentation point, which divides the super-rectangular area corresponding to the root node into two sub-areas.
Generate left and right child nodes with a depth of 1 from the root node: the coordinates of the left child node are $x^1$ is smaller than the sub-region of the segmentation point, and the right child node corresponds to the coordinate $x^1$ for subregions larger than the split point.
Store the instance points falling on the segmentation hyperplane in the root node.
Repeat: for depth The node of $j$ $xlx^l$ is the split axis, $l=j(\text{mod}k)+1$ $xlx^l$ of all instances in the region of the node $x^{The median of l}$ coordinates is the segmentation point, and the super-rectangular area corresponding to the node is divided into two sub-areas.
The depth generated by this node is $j +$ The left and right child nodes of $1$ $xlx^l$ is smaller than the sub-region of the segmentation point, the right child node corresponds to the coordinate $x^l$ is greater than the subregion of the cut point.
Save the instance points falling on the segmentation hyperplane in this node.
Instructing that there are no instances in the two sub-regions is to stop, thus forming Region partitioning of $k-$ $d trees.$

We directly use an example to show how to construct $k -d$ tree.
Given a two-dimensional training data set
$T=\{(2,3)^T,(5,4)^T,(9,6)^T,(4,7)^T,(8,1)^T,( 7,2)^T\}$
Construct a balanced $k -d$ tree.

In the first step, take $x^1$ is the dividing axis, $x^1$ is 7, so we split the data with 7 as the split point. As shown in the figure below:

At this point $(7, 2)$ Points are stored at the root node (depth is 0), there are three points in the left area, and two points in the right area.
In the second step, the depth of the node is 1 at this time, $l = 1 (m o d 2) + 1 = 2$ , choose $x^2$ is the coordinate axis of segmentation. $x^2$ in the left area $x$ The median in $^{2}$ $x^2$ dimension is 6, and the left and right regions are segmented accordingly. As shown in the figure below:

at this time $A k- d$ tree looks like this:
In the third step, the depth of the node at this time is 2, $l = 2 (m o d 2) + 1 = 1$ , select $x^1$ is the coordinate axis of segmentation. After splitting, it is shown in the figure below:

The kd tree is shown in the figure below:

So far we have found that there are no instances in all areas (that is to say, there are no instance points to divide), so stop Region partitioning of $k-$ $d trees.$

2. Use $k- d$ tree for nearest neighbor search

In the previous section, we have introduced how to construct $k d$ tree, in this section we introduce how to use $k- d$ tree for nearest neighbor search.
The algorithm is as follows:
Input: constructed $k d$ tree, target point $x$ .
Output: $The nearest neighbor of x$ .

Starting from the root node, recursively visit $k -d$ tree. If the target point $If the coordinates of the current dimension of x$ are smaller than the coordinates of the segmentation point, move to the left child node, otherwise move to the right child node until the child node is a leaf node.
Let this leaf node be the "current closest point" (compute point $x$ distance from this leaf node).
Back up recursively, at each node (denoted as $o$ ) Do the following:
- (a) If the node $o$ Saved instance point with point $The distance of x$ is greater than the distance from the current closest point to point $x$ is closer, point $o$ as the "current closest point".
- (b) The current closest point must exist in the node $oA$ region corresponding to a child node. Check the parent nodeWhether there is a closer point in the area corresponding to another child node of $o .$ Specifically, check whether the area corresponding to another child node is the same as the target point $x$ is the center of the sphere, and the hypersphere intersects with the distance between the target point and the "current closest point" as the radius. If it intersects, there may be a point closer to the target point in the area corresponding to another child node, and move to another child node. Next, recursively perform the nearest neighbor search, and if they do not intersect, back up.
When returning to the root node, the search ends. The final "current closest point" is The nearest neighbor of $x .$

We use an example to demonstrate The process of performing nearest neighbor search on $k$ $d tree.$ The target point is $x (2, 4.5)$ , to find $The nearest neighbor of x$ .
According to the algorithm flow, after the first step, we move to $(4, 7)$ point. The movement path is: $(7, 2) - (5, 4) - (4, 7)$ .
The second step: let the leaf node $(4, 7)$ is the current nearest point $o$ . Target point and closest point $The distance of o$ is 3.202.
The third step: recursively back up, back to $(5, 4)$ point. (a) The distance between the target point and (5,4) is: 3.041, so $(5, 4)$ The point is closest to the current point $oDistance$ from target point $x$ is closer, then update the current closest point $o$ is $(5, 4)$ . (b) Take the target point $x$ is the center of the circle, and the target point and the "current closest point $o$ "The distance between the radius circle and the current closest point $o (5, 4)$ another child node $(2, 3)$ The corresponding regions intersect. So we move to $(2, 3)$ points, $(2, 3)$ than the current closest point $o (5, 4)$ is closer, so the nearest point o is updated to $(2, 3)$ . Then continue to go back up to $(7, 2)$ points, $(7, 2)$ Point distance from target point ratio $(2, 3)$ The point is far away from the target point.
Step 4: Go back to the root node, and the search ends. At this point the nearest point $o (2, 3)$ is the target point $x (2, 4.5)$ nearest neighbor point.

The overall idea is roughly;

First perform a binary search to find the leaf node
backtracking search path
Check if another child node region has a closer
search another subspace
Finally backtrack to the root node

4. Reference link

Code link: https://github.com/zgaxddnhh/lihang-code
Reference books: Machine Learning Methods. Li Hang

Machine Learning (1) KNN Algorithm

Article directory

foreword

One, $k$ nearest neighbor algorithm

2. Three basic elements

1. Distance measure

2. $Selection of k$ value

3. Classification decision rules

Three, $k d$ tree

1. Construction balance $k d$ tree

2. Use $k- d$ tree for nearest neighbor search

4. Reference link

Guess you like

Machine Learning (1) KNN Algorithm

Article directory

foreword

One, kkk nearest neighbor algorithm

2. Three basic elements

1. Distance measure

2. k k Selection of k value

3. Classification decision rules

Three, kd kdk d tree

1. Construction balance kd kdk d tree

2. Use kd kdk- d tree for nearest neighbor search

4. Reference link

Guess you like

One, $k$ nearest neighbor algorithm

2. $Selection of k$ value

Three, $k d$ tree

1. Construction balance $k d$ tree

2. Use $k- d$ tree for nearest neighbor search