"Statistical learning methods," notes --k neighbor algorithm

k-nearest neighbor (k-nearest neighbor, k-NN) is a basic classification and regression methods. For new input instance, in a given training data set and found this example k nearest instances. It then decided to enter a new category instance of points according to the decision rules (such as the majority of obedience the minority).

K near the three basic elements of this model - the distance metric , select k values and decision rules of classification decisions.

1. distance measure

The distance in the feature space are two examples of their points reflect the degree of similarity, the distance between the point defined instances

The general value of p 2, i.e. we know Euclidean distance, e.g., .

Of course, the nearest neighbor distance metric not determined differently.

For example, it is known in two-dimensional space of three points , and the points X1 and x2 only different values of the first dimension, regardless of the value of the number p, which is always the same distance L, . And , , , . Thus can be obtained: when p is equal to 1 or 2, x1 and x2 are nearest neighbors; when p is greater than equal. 3, x3 and x1 is the nearest neighbor.

2. k values ​​selected

Select the k value has a significant impact on the results of k-nearest neighbor algorithm.

k values ​​selected to be small, it will make the whole model is complicated, prone to over-fitting; and k values ​​selected to be larger, it will make the whole model is simple, can not make accurate classification of a new instance. In practice, the value of k will typically select a smaller value, and then apply the cross validation choose the optimal value of k.

3. The decision rules of classification

Majority voting is often the decision rules of classification k-nearest neighbor algorithm. In this rule, the probability of a new instance of the point x being misclassified as

So if the minimum misclassification rate risk that is minimal experience, it is necessary to make maximum.

Therefore, the majority voting rules are equivalent to the experience of risk minimization.

 

Achieve --kd tree k-nearest neighbor algorithm

 

Kd tree construction

Input: k-dimensional space data set ;

  1. Start: root structure, selected for the axis, in all instances to take the median axis of the cut points, the root corresponding to a rectangular area cut into super two sub-regions, and a depth of 1 the width of the tree. Examples of the left sub-region should be less than the value of the root axes, it is greater than the root right sub-region.
  2. Repeat: a depth of node j, choose the axis of segmentation, in all instances the node region of the median cut coordinates for points, over a rectangular area corresponding to the cut is divided into two sub-regions.
  3. Until the two sub-regions until the absence of instances.

    For example given two-dimensional space data set: construct a balanced kd-tree.

     

     

     

     

    FIG Example 1-1 kd tree

    Kd tree search

    Looking for a given nearest neighbor x.

  4. From kd-tree root, recursively access down, the current-dimensional coordinates of the target point if x is less than the cut points, the left sub-tree recursively, otherwise the right sub-tree until the leaf nodes.

    (2) Set the current reaches the leaf node is the current nearest neighbor (note: may not be true nearest neighbor), and record the current nearest neighbor distance. Back down when the road forward, so that the current and find the nearest neighbor distance division point hyperplane distance is compared with the current form of the parent node of the leaf node, if the current nearest neighbor is relatively small, then do not traverse the current leaf node's parent the other side of the node, or need to update traverse Find nearest neighbor and nearest neighbor nodes.

    (3) in accordance with (2) in order to traverse said, until it reaches the root node, the query is completed.

     

    Kd-tree algorithm to improve search BBF (best bin first)

    For kd tree search algorithm described above is more suitable for a small number of data or dimension much larger than the amount of data when the number of dimensions, but having a higher number of dimensions for the data using the algorithm of BBF, can better improve the performance of the search .

    (1) If the kd-tree is empty, both the distance is set to infinity, return; kd tree if non-empty, then the kd-tree root node is added to the priority queue;

    (2)从优先级队列中出队当前优先级最大的结点,计算当前的该点到查找点的距离是否比最近邻距离小,如果是则更新最近邻点和最近邻距离。如果查找点在切分维坐标小于当前点的切分维坐标,则把他的右孩子加入到队列中,同时检索它的左孩子,否则就把他的左孩子加入到队列中,同时检索它的右孩子。这样一直重复检索,并加入队列,直到检索到叶子节点。然后在从优先级队列中出队优先级最大的结点;

    (3)重复(1)和(2)中的操作,直到优先级队列为空,或者超出规定的重复次数时,返回当前的最近邻结点和距离。

     

     

     

     

     

     

     

     

     

     

Guess you like

Origin www.cnblogs.com/lincz/p/11789809.html