Understanding of k-proximity algorithm and code implementation

github: code implementation
All algorithms in this paper are implemented using python3

1 KNN

KNN (k-nearest neighbor, k-nearest neighbor method), hence the name, is based on the nearest $ k $ neighbors to determine which category an unknown point belongs to. It is defined in "Statistical Learning Methods" as:

Given a training dataset, for a new input instance, find the k instances closest to the instance in the training dataset, most of these k instances belong to a certain class, and divide the input instance into this class.

We analyze the definition intuitively. The known instance points are the colored points in the following figure, different colors represent different categories, and the unknown points are green points. To know which category the unknown point belongs to, you need to find the closest unknown point. The k neighbors of .



When $ k=3 $ , the result is shown in the first circle. At this time, there are 2 red instance points and 1 blue instance point, so the unknown point is determined as the red category.
When $ k=5 $ , the result is shown in the second ring. At this time, there are 2 red instance points and 3 blue instance points, so the unknown point is determined as the blue category.

At this point, everyone must have a preliminary understanding of the idea of ​​​​the KNN algorithm. Next, we will discuss the details and specific implementation of the KNN algorithm according to the definition.

1.1 Key points of KNN algorithm:

Recalling the definition just now, "Given a training data set, for a new input instance, find the k instances in the training data set that are closest to the instance , and most of these k instances belong to a certain class. , divide the input instance into this class" By definition we can find three main points of the KNN algorithm:

  • distance measure
  • Choice of $k$ Values
  • Classification decision rules

(1) Measure of distance

In feature space, we usually use the distance of two instance points to measure the similarity of two instance points . For the KNN model, its feature space is generally a $ n $-dimensional real vector space $ R^n $, that is, for each instance point $ x_i=(x_i^{(1)}, x_i^{(2)}, .. .,x_i^{(n)}) $ has $ n $ dimensions, and the commonly used distance is Euclidean distance: \[ d=\sqrt{\sum_{l=1}^n {(x_i^{(l )}-x_j^{(l)})^2}} \]
Of course, other distance formulas can also be used for measurement.

(2) Choice of $k$ value

The choice of the value of k has a great influence on the results of KNN, as shown above, the class determination of unknown points is different when k=3 and when k=5. A decrease in the value of $k$ means that the overall model becomes complex and overfitting will occur; however, if the value of $k$ is too large, the model is too simple, which is not desirable.
In applications, the value of $k$ generally takes a smaller value, and the cross-validation method is usually used to select the optimal value of $k$.

(3) Classification decision selection

After getting the k points closest to the unknown point, how to determine which class the unknown point belongs to? The majority voting method is usually adopted, that is, each category and the number of the nearest k points are counted, and the category with the largest number is the category of the unknown point. (This method has certain flaws, which we will describe later)

1.2 KNN algorithm steps:

Input: training data set: \[ T=\{ (x_1,y_1),(x_2,y_2),...,(x_N,y_N) \} \]
Where, $ x_i \in R^N $ is the instance Feature vector, $ y_i \in {c_1,c_2,...,c_k} $ is the category of the instance, $ i=1,2,..,N $;
output: the category $ y $ that the instance $ x $ belongs to
( 1) According to the given 距离度量, find the k points that are the nearest neighbors to $ x $ in the training set $ T $, and the neighborhood of $ x $ covering these $ k $ points is denoted $ N_k(x) $

(2) In $N_k(x)$, 分类决策规则determine the category $y$ of $x$ according to (such as majority vote): \[ y={arg max}_{c_g}{\sum_{x_i \in N_k(x) }I(y_i=c_g)}, i=1,2,..,N; g=1,2,..,k \]
In the above formula, $I$ is the indicator function, that is, when $y_i=c_g$ $I$ is 1, otherwise $I$ is 0

1.3 KNN algorithm analysis

Applicable data range:数值型 and 标称型
advantages: high precision, insensitivity to outliers, no data input assumptions
Disadvantages:
(1) High computational complexity : Because for each input of an unknown instance, it needs to be followed by a known training instance. Calculate the distance, and then find $ top-k $ and finally judge the category. The distance calculation is equivalent to a linear scan, and when the training set is large, the calculation will be very time-consuming.

(2) High space complexity : Since the KNN algorithm must save the entire data set, if the training data set is large, a large amount of storage space must be used.

(3) When the training samples are unbalanced (see the figure below), according to the majority voting method in KNN, among the k nearest points, there are obviously more blue points, so it will be judged as the corresponding blue point. category. But it is observed that the category of unknown points may be closer to the red instance part. That is to say, the majority voting method in KNN only considers the number of samples, but ignores the distance. Therefore, we can use the mean or weight method to improve. (Mean: The average distance between all instance points of a certain category in the $k$ nearest neighbors and the unknown point is used as the judgment criterion, and the shortest distance is selected as the result. Weight: The neighbor with the smallest distance from the sample has a large weight, and the distance from the sample is large. is relatively small)



In order to improve the efficiency of the KNN algorithm, you can consider using a special structure to store the training data to reduce the number of times of calculating the distance. One of the methods will be introduced below - kd tree (kd tree)

2 Kd tree

We divide this part into two parts to explain: (1) construct kd tree (2) search kd tree. The former uses a special structure to store the training data set, and the latter uses the kd number structure to search for instances to find their categories. (This method is similar to searching a binary tree. In the following description, it is easier to understand by comparing it with searching a binary tree)

2.1 Construct kd tree

A kd-tree is a tree-like data structure that stores instance points in k-dimensional space for fast retrieval. A kd tree is a binary tree that represents a partition of a k-dimensional space. Constructing a kd tree is equivalent to continuously dividing the k dimensional space with a hyperplane perpendicular to the coordinate axis to form a series of k dimensional rectangular regions. Each node of the kd-tree corresponds to a k-dimensional hyperrectangular region.
The segmentation method described in this article is to perform segmentation according to the median (based on empirical risk minimization), and other methods can also be used for segmentation, such as segmentation according to the degree of data dispersion in a dimension, this method will use variance as the segmentation standard.
The following will explain the division of two-dimensional space and kd tree according to personal understanding:
Given a data set in two-dimensional space: \[ T=\{ (2,3),(5,4),(9,6 ),(4,7),(8,1),(7,2) \} \]
The root node corresponds to the rectangle containing the dataset $ T $, first select the $ x^{(1)} $ dimension, according to $ x^{(1)} $ dimension to sort the dataset: $ { (2,3),(4,7),(5,4),(7,2),(8,1),(9, 6) } $ , for the $ x^{(1)} $ dimension, the median is: 7 (according to the sorted data set, the position of the median corresponds to the median of the subscript, you can use $ median = dataset[len(dataset)//2] $ to determine), select $ (7,2) $ as the root node, then divide according to the $ x^{(1)} $ dimension, and divide the ones less than 7 to the root In the left subtree of the node, those greater than 7 are divided into the right subtree of the root node (this process is similar to searching a binary tree).
At this time, for the left and right subtrees, the same dimension needs to be selected for division (the $ x^{(2)} $ dimension is selected here). Left subtree: For nodes ${(2,3),(4,7),(5,4)}$, sort by $x^{(2)}$ dimension ${(2,3),( 5,4),(4,7) } $ , select the median $(5,4) $ as the root node of the subtree, and similarly place the ones less than 4 in the left subtree of the $(5,4) $ node , those greater than 4 are placed in the right subtree of the $(5,4)$ node. For the right subtree of root node $(7,2)$, sort nodes ${(8,1),(9,6)}$ and choose median 6, since $1 < 6$, will $(8,1)$ as the left child of $(9,6)$. At this point, the kd tree construction is complete. (For each layer of the kd tree, the same dimension is divided, and each division is equivalent to constructing a search binary tree at a time) The
following figure is a division of a two-dimensional space :



The corresponding kd tree is:


At first, I also thought about the division diagram of two-dimensional space for a long time before I understood it. Here, I will tell you my understanding in the form of a diagram:
Combined with the two-dimensional space division diagram and the kd tree binary diagram, the following diagram can be used. The vertical green line is regarded as the root node, the left half of the whole graph is the left subtree space of the root node, and the right half is the right subtree space of the root node:



Next we look at the left and right subtrees of the root node. The green horizontal line in the left half can be regarded as the root node of the left subtree, and the lower part is smaller than the root node in the $x^{(2)}$ dimension, so it is regarded as the left subtree space of the root node. Section is the right subtree space of this root node. Similarly, the right half of the whole graph, the middle green horizontal line is regarded as the root node of the right subtree, the lower part is the left subtree space relative to the root node, and the upper part is the right subtree space relative to the root node.



Steps of constructing a balanced kd tree algorithm:
Input: $ k $ dimensional spatial dataset: \[ T=\{ x_1,x_2,...,x_N \} \] ,
where $ x_i=(x_i^{(1)}, x_i^{(2)},...,x_i^{(k)}), i=1,2,...,N $
Output: kd tree
(1) start: construct root node, root node corresponds to a hyperrectangular region of k-dimensional space containing T. Select $ x^{(1)} $ as the coordinate axis, take the median of $ x^{(1)} $ coordinates of all instances in $ T $ as the cut point, and divide the hyperrectangular area corresponding to the root node Divide into two sub-regions. Slicing is achieved by a hyperplane passing through the slicing point and perpendicular to the coordinate axis $x^{(1)}$. The left and right child nodes with depth 1 are generated from the root node: the left child node corresponds to the sub-region whose coordinate $ x^{(1)} $ is less than the split point, and the right child node corresponds to the coordinate $ x^{ (1)} $ Subregions greater than the split point. Save the instance points that fall on the slicing hyperplane at the root node.

(2) Repeat. For a node with a depth of $ j $, select $ x^{(l)} $ as the division axis, $ l=j%k+1 $, take the $ x^ of all instances in the area of ​​this node The median of {(l)} $ coordinates is the segmentation point, and the super-rectangular area corresponding to the node is divided into two sub-areas. Slicing is achieved by a hyperplane passing through the slicing point and perpendicular to the coordinate axis $x^{(l)}$. The left and right sub-nodes with a depth of $ j+1 $ are generated from this node: the left sub-node corresponds to the sub-region whose coordinates $ x^{(l)} $ is less than the split point, and the right sub-node corresponds to the coordinates $ x^{(l)} $ is greater than the subregion of the split point. Save the instance points that fall on the slicing hyperplane at this node.

2.2 Search kd tree

Utilizing kd-trees can save the search for most of the data points, thereby reducing the computational effort of the search. The following is an example of searching for the nearest neighbor: given a target point, search its nearest neighbor, first find the leaf node that contains the target point; then start from the leaf node, and then go back to the parent node in turn; The node that is the nearest neighbor to the target point, and terminates when it is determined that no closer nodes exist. In this way, the search is limited to the local area of ​​the space, and the efficiency is greatly improved.
From the perspective of space division: take the previously constructed kd tree as an example, find the nearest neighbor of the target point $(3,4.5) $. Also perform binary search first, first from $(7,2)$ to $(5,4)$ node, when searching, $y = 4$ is used to divide the hyperplane, because the search point is $y The value of $ is 4.5, so enter the right subspace to find $ (4,7) $ , forming a search path: $ (7,2) \rightarrow (5,4) \rightarrow (4,7) $, take $ (4 ,7) $ is the current nearest neighbor point. Take the target search point as the center, and determine a red circle with a radius of 2.69 from the target search point to the current closest point. Then backtrack to $ (5,4) $ , calculate the distance between it and the search point to be 2.06, then the node is closer to the target point than the current closest point, and $ (5,4) $ is the current closest point. Using the same method to determine a green circle again, it can be seen that the circle intersects the y = 4 hyperplane (the intersection can be considered that there may be a point in the intersection area, and all points in the intersection area are more than $ (5.4) The distance of the $ point is short, so as long as the intersection is judged, you need to enter another subspace to search), so you need to enter another subspace of the $(5,4) $ node to search. $ (2,3) $ The distance between the node and the target point is 1.8, which is closer than the current nearest point, so the nearest neighbor point is updated to $ (2,3) $ , and the nearest distance is updated to 1.8, which can also determine a blue circle. Then back to the root node $ (7,2) $ according to the rules, the blue circle does not intersect the hyperplane of $ x=7 $ , so there is no need to enter the right subspace of $ (7,2) $ to search. At this point, the search path is backtracked, and the nearest neighbor point $(2,3) $ is returned, and the nearest distance is 1.8.



The nearest neighbor search algorithm with kd tree:
input: the constructed kd tree; target point $ x $;
output: the nearest neighbor of $ x $.

(1) Find the leaf node containing the target point $x$ in the kd tree: starting from the root node, recursively visit the kd tree downward. If the coordinate value of the current dimension of the target point is less than the coordinate value of the split point, move to the left child node, otherwise move to the right child node. until the child node is a leaf node;

(2) Take this leaf node as the "current closest point";

(3) Recursively go back up, and perform the following operations at each node:

  (a) If the instance point saved by the node is closer to the target point than the current closest point, the instance point is taken as the "current closest point";

  (b) The current closest point must exist in the area corresponding to a child node of the node. Check whether there is a closer point in the area corresponding to another child node of the parent node of this child node. Specifically, it is checked whether the area corresponding to another child node intersects the hypersphere with the target point as the center and the distance between the target point and the "current closest point" as the radius. If it intersects, there may be a point closer to the target in the area corresponding to another child node, and move to another child node. Next, the nearest neighbor search is performed recursively. If they do not intersect, go back up.

(4) The search ends when it falls back to the root node. The last "current closest point" is the nearest neighbor of x.

2.3 Analysis of kd tree algorithm

If the instance points are randomly distributed, the average computational complexity of kd tree search is $ O(\log N) $ , $ N $ is the number of training instances, kd tree is more suitable for $ k where the number of training instances is much larger than the spatial dimension $ Nearest neighbor search, when the spatial dimension approaches the number of training instances, its efficiency will drop rapidly, almost approaching a linear scan.

Quotes and references:
[1] https://www.cnblogs.com/21207-iHome/p/6084670.html
[2] "Statistical Learning Methods" by Li Hang
[3] "Machine Learning in Practice" by Peter Harrington

Written at the end: This article refers to the above materials for integration and summary, which is original. There may be inappropriate understandings in the article. If you have any opinions or objections, you can comment below, thank you!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325339553&siteId=291194637