K Nearest Neighbor (k-Nearest Neighbor, KNN) algorithm, a learning method based on example

1. instance-based learning algorithm

0x1: some knowledge of data mining context

This article is an introductory article K nearest neighbor data mining algorithms, the so-called data mining, is to discuss how to find patterns in the data in a discipline.

In fact, the historical development of human science and technology, has been accompanied by data mining, people have been trying to find patterns in the data,

  • Hunters look for patterns of behavior in animal migration
  • Farmers look for patterns in the growth of crops
  • Politicians look for patterns in the opinion of voters
  • Lovers look for patterns in each other's reactions
  • Entrepreneurs work is to identify opportunities to find out the behavior that can be converted into profitable business of some of the models, but also the so-called successful model, and take advantage of these opportunities
  • Scientists are working to understand the data, looking for patterns from the data, and use them to guide how it works in the real world
The so-called data mining, in essence, by analyzing the present in the database to solve the problem in the data, data mining is defined to identify patterns in the data of the process that must be automatic or semi-automatic.

The next question is how to represent data patterns ?

Valuable data model allows us to make predictions about the extraordinary new data represent a data model, there are two extreme ways:

  • One is the internal structure difficult to understand the black box
  • One is transparent structure showing the structure of the model , its structure revealed a pattern structure

Both methods are likely to make a good prediction, the difference is that they are excavated model can be in the form structures that clearly show whether this structure withstand analysis, justifications, can be used to form the future decision-making .

If the pattern can be obtained obvious way to decision-making structures, called structural model (Structural pattern) , in other words, the structural model can help explain some of the phenomena relevant data, with interpretability.

Still further, the process of mining machine learning from the data structure model referred to as knowledge representation , machine learning models can be found in many different means of expression, each method is a technique of inferring the data output structure, comprising:

  • Table (Table) : The so-called table, it is the use of the same input form, complete storing all input data. This is the simplest, most basic structure of the output of machine learning, which is equal to storing and searching technology is essentially
  • Return to table (Regression the Table) : linear regression model is the core objective of a return to the table, the table is how to return to feature selection, forming a smaller set of data than the original, concise decision table, the key issue is to determine which properties and removal It will not affect the final decision
  • Decision table (Decision the Table) : decision tree model is a kind of decision tables
  • Examples table (instance the Table) : K article will introduce neighbors is based on a learning algorithm instance, it has been optimized on the basis of the traditional table, effectively reduce the need for storage and overhead and optimize the efficiency of searching

0x2: What is instance-based learning algorithms?

Learn the ins and outs of knowledge outside the context, we will narrow the field of vision to the "example-based machine learning" in this area.

In the instance-based learning, training samples are stored verbatim, and to determine which instance using a training set and a test case of unknown function closest distance. Once the training examples nearest found, belongs to the class closest instance of the class is predicted to be the test case.

Based algorithm instance, sometimes referred to as memory-based learning , it is not clear inductive probability distribution or interface, but a new problem compared with the example of the training process have seen examples, these examples are seen in memory in.

Relevant Link:

"Practical Data Mining Machine Learning Tools and Techniques" Ian H.Witten Eibe Frank, Mark A.Hall

2. Effective find the nearest neighbors from talking about issues

Let's look at the picture below,

Our goal now is to assume that recent example from the green point looking figure. The answer may seem obvious, it is clear that the red triangle to the lower right.

But as long as we subject a little transformation, such as the spatial dimensions from two-dimensional to expand dimension 100, an increase of 10 times the density of points on the graph, this time, the answer is also still obvious? Apparently not anymore!

The most intuitive idea is that we need to build a table (table), the figure coordinates of all data points are saved, such as:

point x1 ...... xn
A 1 ...... 2
..      
N 5 ...... 3

In this way, each time a new input point, we traverse the entire table again, or find the nearest one k instances of distance according to a certain distance function (eg Euclidean distance).

This scenario is very common requirement in the real world, such as:

  • Similar text found
  • Similarity Clustering
  • Target Classification
  • ..... 

Although the look-up table described above is very simple and effective, but the disadvantage is very slow. This process is the number of training examples is linear, in other words, a single prediction is proportional to the number of training examples and the time spent. The processing time of the entire data set takes the number of instances of the training set O (train) and test set number of instances O (test) proportional to the product.

How to solve this critical conflict? Scholars start from the data structure optimization, we proposed a series of efficient data structure and search algorithm, they are collectively referred to as K nearest neighbor, K-nearest neighbor algorithm to solve the core problem is how to find the nearest neighbor set of instances, support this structure is a kind of search for knowledge representation. We then discuss it.

3. K-nearest neighbors (k-nearest neighbor KNN) algorithm Introduction

K-nearest neighbors (k-nearest neighbor KNN) is a data-driven (based on an algorithm (Instance-based Algorithms) example) of classification and regression methods. It belongs to a discriminant model.

0x1: Applicable scene

1. Multi-classification scene

k-nearest neighbor classification in the multi-input method is an example of k-nearest neighbor feature vectors, feature space corresponds to the point, it is an example of output categories.

Suppose a given k-nearest neighbor set of training data, examples of which category has been set, when the classification of a new instance, according to the type of its k nearest neighbors of training examples, predicted by majority vote like. Thus, k-nearest neighbor learning process does not have a display (or a delay learning), k-nearest neighbor set is actually using the training feature vector data space is divided, and as a "model" of the classification.

Scene 2. The issue of return

KNN algorithm not only can be used for classification, regression can also be used. By finding a sample k nearest neighbor, the average value of the property assigned to these neighbors of the sample, you can get the properties of the sample.

The method is more useful to the neighbors of the samples produced at different distances given different weights (weight), as the weight proportional to the distance.

0x2: arithmetic model

Input training data set www.wityx.com, which www.wityx.comis an example of the feature vector, www.wityx.comis an example of the category.

According to the given distance metric , with x in the training set to find the nearest  K (the need to manually set the key parameter)  points, according to the neighborhood of the K dots classification decision rules determine the category y x

Special case of K-nearest neighbors is the case k = 1 is referred to as the nearest neighbor algorithm, i.e. for instance point (feature vector) x, the nearest neighbor in the training dataset with x nearest class as the class of the input x.

Y = P (y | x): where P is the probability function is determined from the equation to minimize some

You can see, the learning process K-nearest neighbors, not shown, as a discriminant model, the specific form of the discriminant function of k-nearest neighbor method is not very clear.

K nearest neighbor model used actually correspond to the feature space is divided, in a sense, k nearest neighbor model feature space is the sample itself.

In the k-nearest neighbor method, when the following basic elements:

  • Training set
  • Distance Measurement
  • k value
  • Classification decision rule

After the determination, for any input a new instance of the class to which it belongs is uniquely determined. This corresponds to the feature space is divided into several sub-space, determined for each type of sub-space point belongs in accordance with the above-described essential elements.

Feature space, for instance each training point x1, all other points than the points closer distance from the point to form a region, called cells (cell). Each unit has a point of training examples, all training examples of a dividing point of the feature space units:

K neighbors well reflects the thinking discriminant model, k nearest neighbor does not generate a probability distribution function, it is only through the three basic elements of probability density features of the training sample as much as possible "catch." The reason input samples will be assigned to different classes, which essentially lies in the existence probability of non-uniform density distribution, i.e., a certain proportion of one category density region than other classes in the training sample.

Let's talk specifically about a few basic elements of K nearest neighbor model.

1. K value selection

The results will select K-nearest neighbors of K values ​​have a significant impact.

  • If selecting a smaller value of K is equivalent to a smaller prediction using training examples in the neighborhood
    • "Learning" approximation error (approximation error) will be reduced, only closer to the input instance (similar to the smaller distance Lp) training examples will play a role in predicting the results
    • But the drawback is "learning" the estimation error (estimation error) will increase, is very sensitive to predict the results will point neighbors instance, if the neighbor happens to be an instance of point noise prediction will be wrong, can be understood as a more fault-tolerant model difference
    • In other words, it means to reduce the value of k becomes complicated overall model (because of the small search range, so the total required search result storage unit becomes more), prone to overfitting.
  • If you choose a larger value of K is equivalent to the prediction of training examples with larger neighborhood
    • The advantage is you can reduce the estimation error learning, that is less susceptible to fluctuations in neighboring point
    • But the drawback is the learning approximation error increases, then farther to the input instance (dissimilar) training examples will also predict the effect of the prediction error
    • Overall, it means increasing the overall value of the model easier K

Approximation error and estimation error core assumption that the difference in noise so near the point of disturbance or noise far more disturbing point more.

In practice, K is a selected value is generally relatively small value, i.e. we essentially considered the correlation of neighboring points is relatively large, while the correlation is relatively small origin

In the actual project, probability sample selection and the K value of the density distribution of relevant, there is no set formula to tell me what the theory of what K value of the option, the good news is that the K value of the search space is not great, usually we can take a loop for cross-validation to search for all possible values ​​of K, by repeating the experiment several times, each time randomly selected different train set and validate set, see KNN model fit to the train set and validate set and pan ability to select the best K value.

Theoretically :

Theoretically, when the number n k instances and have become infinite, so that k / n -> 0, the probability that the error generated in the data set will reach the theoretical minimum .

2. The distance measure

Examples from two points in feature space are two examples of a point of similarity metric digitized.

K nearest neighbor model feature space is generally n-dimensional real vector space www.wityx.com, calculate the distance between two vectors There are many ways that this is not unique to KNN algorithm, the distance between the vector model is based on the concept of mathematics:

Provided the feature space www.wityx.comis an n-dimensional real vector space www.wityx.com:

www.wityx.comwww.wityx.com

www.wityx.comThe Lp distance is defined as:

1) P = 1: Manhattan distance (Manhattan distance)

2) P = 2: Euclidean distances (Euclidean distance)

3) P = positive infinity: maximum distance from the respective coordinates

The following figure shows the two-dimensional space for different values ​​of p, and the distance from the origin point Lp 1 (Lp = 1) is set pattern

This figure good illustration taken from different strategies for KNN search for close to the point is different, thereby affecting subsequent identification and classification results

An example of how the different distances determined by the nearest neighbor metric are different. 3-point known two-dimensional space www.wityx.com, we have to compare the different values when p, nearest neighbor distance X1 of lower Lp

P values And the distance X2 And X3 distance
1 4 6
2 4 4.24
3 4 3.78
4 4 3.57

See, p = 1/2 when, X2 X1 is a nearest neighbor; P> = 3, X3 X1 is the nearest neighbor.

3. The decision rules of classification

Classification decision rules neighbor method is often a majority vote that the input instance of the class is determined by the input training examples Examples of the k close in most of the label. When the algorithm reduces to 1 K = nearest neighbor search.

Is the essence of optimization majority voting rule (majority voting rule) technology, it is consistent with the principle of maximum likelihood estimation.

We assume that the loss of function is classified as a 0-1 loss function:

Then the probability of misclassification is:

For a given instance www.wityx.com, its nearest neighbors k training examples constitute a set of points www.wityx.com, if covered by www.wityx.comthe category of regions is Cj, then the misclassification rate:

To minimize misclassification rate, which is the smallest empirical risk, it is necessary to make the formula maximum:

That model predictive accuracy of training samples (the fit) maximum. So the majority voting rules equivalent to experience risk minimization.

0x3: learning strategies

KNN learning strategy is simple, that is, in the training set to find instances of point K nearest neighbors, and voting, select the category that most, namely empirical risk minimization. This also reflects the core idea of ​​maximum likelihood estimation.

0x4: Learning Algorithm

KNN strategy is very intuitive and easy to understand algorithm to solve a specific engineering problem, namely, how to choose the k neighbor points in high-dimensional space, when the dimension is high and a large amount of sample time, if you do not take some efficient the search algorithm direct violence, then it is very time-consuming.

Ideas to solve this problem is to use the data storage structure specially constructed and special search method to complete, which is the focus of this discussion section.

A data structure of the first to be presented is a tree structure, by applying the idea of ​​partition, the original tree structure of O (N) complexity is reduced to (logN) complexity of O, we are going to discuss it.

Tree 1. KD (kd tree) structure

KD tree is a point for example a K-dimensional space is stored for its fast retrieval tree data structure. kd tree is a binary tree , a tree is a super-KD plane, represents a division of K-dimensional space (partition). KD-tree structure corresponding to a constantly perpendicular to the axis K-dimensional space hyperplane slicing, and comprising a sequence of K-dimensional super rectangular area. KD-tree each node corresponds to a rectangular region over the K-dimensional.

Where K is the number of the data set attributes.

The following figure shows a small example of k = 2:

The tree will not necessarily develop into every instance of the same depth, the training set and a node in the tree corresponds to a maximum of half of the node is a leaf node.

2. Construction of KD tree

Input: k-dimensional space data set www.wityx.com(the number of samples is N), the dimension of each sample is k,www.wityx.com

  • Configuration root node corresponding to a rectangular region including all instances over points in a K-dimensional space,
  • Selected www.wityx.comfor the axis, in all instances in the set of data T www.wityx.comcoordinates median do i is cut points , the super root node corresponding to a rectangular area cut into two sub-regions, the segmentation by cutting points and coordinate axis www.wityx.comperpendicular to the hyperplane achieved. 1 is generated by the depth of the root of the left and right child nodes :
    • Coordinate corresponding to the left child node www.wityx.comis less than the cut points of the sub-region
    • Coordinate corresponding to the right child node www.wityx.comis greater than the cut points of the sub-region
  • Repeating the above division process, the depth of node j, choose www.wityx.comthe axis of the segmentation, www.wityx.com(modulo loop continuous binary segmentation according to each dimension), in all instances the area of the node www.wityx.comcoordinate median number as a segmentation point , the super node corresponding to a rectangular area cut into two sub-regions. Segmentation by the coordinate axes and points by cutting www.wityx.comperpendicular to the hyperplane achieved. Generated by the depth of the node j + 1 to the left and right child nodes:
    • Coordinate corresponding to the left child node www.wityx.comis less than the sub-region segmentation point;
    • Coordinate corresponding to the right child node www.wityx.comis greater than the cut points of the sub-region;
    • Examples of the segmentation points fall exactly on the hyperplane is stored in the node
  • Until the two sub-regions is not stopped instances exist (i.e., points are assigned to all instances of a subspace), thereby forming a region dividing KD tree, the last cut points obtained by the left and right sub-rectangular space is a leaf node

An example will be described with a kd-tree construction process, given a two-dimensional space data set:

  • Super root node corresponds to a rectangular area comprising a data set of T
  • Select www.wityx.comshaft, six data points www.wityx.comin a coordinate is 7 bits, a planar www.wityx.comspace is divided into left and right quantum sub-rectangle (sub-nodes)
  • Next, a second median dimension is left rectangle 4, so to www.wityx.combe subdivided into two sub-rectangles; rectangle to the right www.wityx.comis divided into two sub-rectangles
  • Segmentation continues until all the points have been assigned to instances on a hyperplane, is cut out of a space that does not contain ultra-pure rectangular space until the point in any instance, the last cut points obtained by the left and right sub-rectangle is the space leaf

3. KD tree update

Compared with most other machine learning methods, based on one example of the advantages of learning a new instance can be added to the training set in at any time.

The new data comes, the need to determine a new data point to which it belongs to the leaf node, and find hyperrectangle leaf node. If the super rectangle is empty, the new data point will be placed there, otherwise hyperrectangle split, split in the longest edge to keep the square.

This simple heuristic method does not guarantee that after the addition of a series of dots, the tree will still maintain the balance, there is no guarantee for the search of the nearest neighbor to create a good super rectangle. Sometimes the number of rebuilt from scratch would be a good policy. For example, when the depth of the tree reaches a depth twice the optimum value.

4. KD tree search

After completion of the achievements KD tree, then discuss how efficiently a K-KD tree search:

Input: The kd-tree structure of a train set; target point x
output: x nearest neighbor

  • KD tree leaf node to identify the target point x comprises: starting from the root node, recursively down the tree access KD, if the target point is less than the x coordinate of the current cut-dimensional coordinates of points, the move to the left child node, otherwise, move to the right child node, (hyperrectangle an area free of training examples) until the child node is a leaf node has been
  • The smallest rectangular area over the corresponding leaf node containing the target site comprises the target point, this instance of the leaf node temporarily as "current closest point", note here that for the time being because the point is not necessarily an example of the leaf node really is the nearest neighbor point, theoretically target a certain point in the nearest neighbor to the target point as the center and circumference by the current super-sphere internal (local optimization principle) the nearest point, the next goal is to try to reverse back to find in this area within the hypersphere are there examples in other points than the leaf nodes "current closest point" closer
  • This leaf node is the "current closest point" recursive up back to back, do the following on each node (parent): Repeat this process, in order to fall back to the root, the search ends, the final storage of view "Current recently point "is the nearest neighbor of x
    • If the node saved instances point closer to the target than the current point nearest point, already point to the example of the "current closest point."
    • If the rectangular region over the other child nodes of the node and the hypersphere intersection (there may be another local optimum), the closer the target point to find examples in the region of intersecting points. If there is such a point, this point as a new current nearest neighbor algorithm to be more on a parent node
    • If the rectangular region over the other child nodes of the parent node does not intersect with the hypersphere, illustrating another branch of another local optimization does not exist, the branch continues the reverse backtracking
  • During the rollback, and continue to look for the nearest point of the target node, when determining the termination closer node does not exist. Such a search is restricted to a localized area of ​​space, efficiency greatly improved (and this is the core idea of ​​binary tree - divide and conquer)

Using the search process described in FIG kd-tree, the root node is A, the child node is a B, C, D, E, F, G; target instance point S, S find nearest neighbor

  • First forward in search KD tree, navigate to the bottom right corner of the rectangular area that most, then start back tracing process
  • Examples of point of the rectangular area is D, D so as to temporarily approximate nearest neighbor. However, in theory, some nearest neighbor center point S is inside the circle through point D (local optimization), it is necessary to continue to back trace 
  • Returns the parent node D and node B, try searching in the area of ​​another sub-node F Node B, but since the node F superquadratics region does not intersect, there can be no nearest neighbor
  • Continues back on a parent node A, the node attempts to search in the region of another child node of the C A, C junction region intersects the circle, the rectangular area in a little over the point E in the example of the park, point E than point D closer, like recently become a new point
  • Finally get the point E is a point of S nearest neighbor

5. KD challenges and dilemmas facing the tree

We have seen, KD tree that can be used effectively to find good data structure nearest neighbor, but not perfect. When faced with uneven data set of data, they face some basic conflicts and challenges:

  • Require both have a perfect balance of the tree structure, but also requires an approximately square region
  • More importantly, a rectangle, a square shape is not the best use, because they have an angle. Examples of neighbors in the points near the boundary of the search is not "soft" corners of the rectangle is a tricky question

Here a so-called balanced structure, refers to both sides of the bifurcated tree to try to average distribution, which can maximize the effect of optimization O (logN), but if the data point distribution is very uneven, a method may be employed in the median multi more subsequent production split the same direction, thereby producing ultra-thin shaped rectangular. A better solution is to use the average value as the split point rather than the median. KD-tree thus generated will tend to square.

But the technology is still split point average of not completely avoid the problem KD native, for academic proposes an improved method over the ball instead of the interface hyperrectangle the interface.

6. Ball Tree Search

KD tree of thinking is very subtle, but there are some problems, not the form used in the best way here. Resulting in skewed data gathering we want to maintain the balance of the tree and conflict characteristic square area. Further, a rectangular or even square is not used in the most perfect shape here, due to its angle. To solve these problems, the introduction of ball tree, that is, rather than use Ultra Super rectangular divided region

Ball Tree is a K-dimensional hypersphere to cover the training sample points.

The figure (a) shows a two-dimensional plane including the observation example in FIG. 16, FIG. (B) is the corresponding ball tree, wherein the node number indicates the number of points included in the observation.

Each node in the tree corresponds to a circle, the node number indicates the number of points observed in the region containing the security, but not necessarily in the region encompassed FIG point, because there is overlap, and an observation point can only belong to a region. The actual ball tree node storage center and radius. Save leaf node observation points it contains.

1) ball tree structure

Ball tree achievements and KD tree substantially uniform process, except that ball tree points are each cut in the center of the current hypersphere.

  • Select a circle furthest from the current observation point i1, and farthest from the observation point i2 i1
  • The all round from the two points nearest observation points are assigned to the center of the two clusters
  • Then calculate the center point of each cluster containing all observations point to which they belong minimum radius, so that the left and right child and split into a new super-sphere.

2) Search ball tree

When using Ball tree:

  • First from top to bottom to find the leaf node contains the target, and from this node to find its nearest observation points away. This distance is the upper bound of the nearest neighbor distance.
  • Check its sibling whether to include a smaller than this upper bound of observation points, and KD tree, like brothers inspection standard is the super-super-sphere and the sphere of whether the current node intersect
    • If there is no intersection, then the sibling can not contain "more neighbor points"
    • If there is an intersection, the search continues siblings

Relevant Link:

http://blog.csdn.net/pipisorry/article/details/53156836

Guess you like

Origin www.cnblogs.com/w4ctech/p/11826681.html