Machine Learning Series --Knn kd tree algorithm Detailed

About knn algorithm, dividing the feature space method for calculating the distance between the new training input examples and examples, the degree of similarity in the feature space as in Example 2 wherein the distance can be represented. Generally we use the Euclidean distance, which means that each new input instance needs to be calculated with all the training examples and ordering a distance. When the training set is very large, the calculation is very time-consuming, memory consumption, resulting in reduced efficiency of the algorithm. These are the simple understanding of knn algorithm.

kd-tree (the k-dimensional tree)
is an example of a binary tree structure of a point of a k-dimensional space for storing them quick search. Using the kd-tree search can be omitted for most data points, thus reducing the amount of computation to search.

kd-tree each node are k-dimensional binary tree data points, on which each node represents a hyperplane, the hyperplane perpendicular to the current axis dimension divided, and divided into two parts on the space dimension, a portion of its left subtree, another portion of its right subtree. I.e., if the current node is divided dimension d, all points of its left subtree is less than the current value of the d-dimensional coordinate values ​​in the right subtree of all points in d-dimensional coordinate values ​​are greater than equal to the current value, this definition thereof any child nodes are established.

Set (2,3), (5,4), (9,6), (4,7), (8,1), (7,2). Set (2,3), (5,4), (9,6), (4,7), (8,1), (7,2).
When building the root node, this time slicing dimension x, x in the above set of points in ascending order of dimension (2,3), (4,7), (5,4), (7,2), ( 8,1), (9,6); wherein a value of (7,2). (Note: 2,4,5,7,8,9 value but the algorithm shall be within the set point, the median value (5 + 7) / 2 = 6, used in the mathematical calculation is len (points) // 2 = 3 , points [3] = (7,2))

(2,3), (4,7), (5,4) linked to the (7,2) node the left subtree, (8,1), (9,6) linked to the (7,2) node right subtree.

Construction (7,2) of the left subtree of the node, the set of points (2,3), (4,7), (5,4) At this time slicing dimension y, the value of (5,4) as division planes (2,3) linked at its left subtree, (4,7) linked at its right subtree.

When building the right subtree (7,2) of a node, a set of points (8,1), (9,6) is also at this time slicing dimension to y, the value of (9,6) as the split plane (8 , 1) hanging on its left subtree. At this point kd tree build.

According to the above steps, kd tree divides the two-dimensional space to show it:

Here Insert Picture Description

How to construct a Kd-tree?

For Kd-tree such a binary tree, we first need to determine how to divide the left subtree and right subtree, that is a K-dimensional data is based on what is divided to the left or right subtree subtree.

In constructing a tree BST-dimensional, one-dimensional data with the results of the root node and the intermediate nodes of the tree are compared to determine the size of the subtree is partitioned into left or right subtree Similarly, we can follow this manner, the K-dimensional data with a Kd-tree root and intermediate nodes of comparison, but not the whole of the K-dimensional data comparison, but rather selects one dimension Di, and then compare the two dimensions K in the magnitude relation Di dimension, i.e. a dimension Di to select each of the K-dimensional data is divided, corresponding to a hyperplane perpendicular to the dimension Di in the K-dimensional data space is divided into two, one side of all the planes All the K-dimensional data values ​​in a K-dimensional data Di on the other side of the plane dimension smaller than a corresponding dimension values. In other words, we each selected dimensions as a division of K-dimensional data space will be divided into two parts, if we are to continue these two sub K-dimensional space above division, will get a new subspace of new space and continue the sub division, the process is repeated until each subspace can no longer be divided up. Kd-Tree is to construct the above process, the above process involves two important issues:

1, each time the division of the sub-space divided on how to determine which dimension.
2, when a dimension divided on how to ensure that the number of divided in two subsets obtained in this dimension equal as possible, i.e. the number of sub-tree nodes left and right subtree equal as possible.

Solve the first problem

The easiest way is to round the years, that is, if the selected data is divided on the i-th dimension, and that the next time would be divided on the first j (ji) dimensional
distribution of wood as if a K-dimensional data set of the same that illustrate the K-dimensional data on the long wooden bar dimension represents the direction of the scattered distribution of these data was more open, mathematically speaking, is the variance of the data on the dimension (invariance) is relatively large, in other words , because the data comparison dispersed in the opening dimension, we divide them easier to open in this dimension will, therefore, this leads to another approach we have chosen dimensions: varimax (max invarince), that is, each time we choose to divide dimensions are dimensions with the largest variance.

Solve the second problem

We assumed that the current selected for partitioning a K-dimensional data set S i in accordance with the varimax dimension, in this case we need to dimension i on the K-dimensional data set S is divided into two subsets A and B, the subset A data values ​​are smaller than the dimension i in subset B. First consider the simplest method of division, i.e., the first selected number as the comparison object (i.e. divided shaft, pivot), all other S K-dimensional data related to the remaining pivot compared in the dimension i, if the pivot is smaller than the plan A set, set B is greater than the included. The A set and B set are regarded as the left subtree and right subtree, then we construct a binary tree, when, of course, it is desirable to try to balance a tree, i.e., the number of nodes in the left and right subtrees differ little. A number of the data set and B set value relating to pivot with apparently because they are after comparison with the corresponding pivot to be divided to the collection. Well, the problem now is to determine the pivot. Given an array, how to get the two sub-arrays, the number of elements contained almost two arrays and in which a sub-array element values ​​are less than the other sub-arrays of it? The method is very simple to find the median value (median i.e., Median) array, all of the elements in the array is then compared with the value can be obtained of the two sub-arrays. Similarly, in the dimension i is divided, the values ​​of all Pivot selects the data on the dimension i, the number of subsets of data thus obtained two basically the same.

Kd-Tree construction algorithm is:

(1) Select the K-dimensional data set having the largest variance dimension k, and select the value of m in the dimension on the pivot data set is divided to obtain two subsets; while creating a tree node Node, for storage;

(2) repeating procedure (1) of two subsets step, until all subsets can no longer be divided up; if not subdivided into a subset, the subset of the data stored to the leaf node (leaf node ).

Kd carried out using the nearest neighbor algorithm steps to find a tree query

(1) to query data from the root node Q starts, according to the comparison result Q each node access to the downwardly Kd-Tree, until reaching a leaf node.

Wherein the comparison means and the node Q is the value of k m on the dimension corresponding to the node Q is compared, if Q (k) <m, then the left subtree access, or access right subtree. Leaf node is reached, the Q and calculates the distance between the leaf nodes stored data, recorded data corresponding to the minimum distance point, referred to as the current "nearest neighbor" Pcur and the minimum distance Dcur.

(2) backtracking (backtracking) operation, which is to find a more recent Q from the "nearest neighbor." I.e. the branch not been accessed is determined whether there's from point Q closer, the distance between them is less than Dcur.

If the distance between the lower branch of the parent node Q and smaller than DCUR not been accessed, the presence of this branch is considered more recent data from the P, entering the node, for (1) the same step search process, if find more recent data point is updated to the current "nearest neighbor" Pcur, and update Dcur.

If the distance between the Q branch is not in the access node and its parent is greater than through Dcur, then the closer to the point Q does not exist in the branch

References:
https://www.jianshu.com/p/abcaaf754f92

Released five original articles · won praise 0 · Views 268

Guess you like

Origin blog.csdn.net/Nick_Dizzy/article/details/105269324