The difference between k-means and k-NN

k-means and k-NN are two algorithms that are often confused. Even veterans who have done machine learning for many years may talk or forget the distinction between the two algorithms. Next, we will use a document to clarify the difference between k-means and k-NN~

The fundamental difference between the two algorithms is:

k-means is unsupervised learning, k-NN is supervised learning; k-means solves clustering problems, and k-NN solves classification or regression problems.

The k-means algorithm divides a data set into clusters, so that the formed clusters are isomorphic, and the points in each cluster are close to each other.

The k-NN algorithm attempts to classify an unlabeled observation based on its k (can be any number) surrounding neighbors.

The training process of the k-means algorithm requires iterative operations (finding new centroids), but k-NN does not.

The k in k-means represents the cluster center, and the k in k-NN represents the number of the first k training samples that are closest to the test sample.

k-means

k-NN

learning paradigm

Unsupervised Learning Algorithms

Supervised Learning Algorithms

proposed time

1967

1968

Applicable issues

Solving clustering problems

Solve classification or regression problems

main idea

Birds of a feather flock together

Those who are close to vermilion are red, those who are close to ink are black

Algorithm principle

K-means is a center-based clustering method. By iterating, samples are divided into k categories, so that each sample is closest to the center or mean of the category to which it belongs; k categories are obtained to form a division of the space.

The k-NN algorithm is simple and intuitive. Given a training data set, for a new input instance, find the k nearest neighbors to the instance in the training data set. Most of the k instances belong to a certain class, and the Input instances fall into this class.

Algorithm process

The algorithm of k-means clustering is an iterative process, and each iteration includes two steps. First select the centers of k classes, and assign the samples one by one to the class closest to the center to obtain a clustering result; then update the mean value of the samples of each class as the new center of the class; repeat the above steps until convergence until.

(1) When a new test sample appears, calculate its distance to each data point in the training set; (distance measure)

(2) Select the first k training samples with the smallest distance from the test sample according to the distance; (k value selection)

(3) Classify the category of the new sample based on the categories of the k training samples, and usually select the label with the most occurrences in the k training samples as the category of the new sample. (decision rule)

Algorithm diagram

the meaning of k

k is the number of classes

k is the number of adjacent data used for calculation

choice of k

k is the number of classes, which is an artificially set number. You can try clustering with different k values, test the quality of the clustering results obtained by each, and speculate on the optimal k value. The quality of clustering results can be measured by the average diameter of the clusters. Generally, when the number of categories becomes smaller, the average diameter will increase; when the number of categories increases beyond a certain value, the average diameter will remain unchanged; and this value is the optimal k value. During the experiment, binary search can be used to quickly find the optimal k value.

The choice of the value of k can have a significant impact on the results of k-NN.

·If you choose a smaller value of k, it is equivalent to predicting with training instances in a smaller neighborhood, the approximation error of "learning" will be reduced, and only the closest (similar) to the input instance Only training examples will have an effect on the prediction results. But the disadvantage is that the estimation error of "learning" will increase, and the prediction result will be very sensitive to the instance points of the neighbors. If the neighboring instance points happen to be noise, the prediction will be wrong. In other words, the reduction of the k value means that the overall model becomes complex and prone to overfitting.

· If you choose a larger value of k, it is equivalent to using training instances in a larger neighborhood for prediction. Its advantage is that the estimation error of learning can be reduced, but the disadvantage is that the approximation error of learning will increase. At this time, training instances that are far (dissimilar) from the input instance will also play a role in the prediction, making the prediction wrong. The increase of k value means that the overall model becomes simpler.

· If k=n, then no matter what the input instance is, it will simply be predicted to belong to the class that has the most number of training instances. At this time, the model is too simple and completely ignores a large amount of useful information in the training examples, which is not advisable.

·In the application, the k value generally takes a relatively small value. Usually, the cross-validation method is used to select the optimal k value.

k and result

After the k value is determined, the results may be different each time. K objects are arbitrarily selected from n data objects as the initial clustering center, and randomness has a great influence on the results.

In the k-NN algorithm, when the training set, distance measure (such as Euclidean distance), k value and decision rule (such as majority voting) are determined, for any new input instance, the class it belongs to is uniquely determined.

the complexity

Time complexity: O(n*k*t), n is the number of training instances, k is the number of clusters, and t is the number of iterations

Linear scan time complexity: O(n)

Time complexity of kd tree method: O(logn)

Algorithm Features

It is a clustering method based on division; the number of categories k is specified in advance; the distance between samples is represented by the square of the Euclidean distance, and the category is represented by the center or the mean of the sample; the sum of the distance between the sample and the center of the class to which it belongs is The optimized objective function; the obtained categories are flat and non-hierarchical; the algorithm is an iterative algorithm, and the global optimum cannot be guaranteed.

The k-NN algorithm has no explicit learning process; when implementing k-NN, the main consideration is how to perform a fast k-nearest neighbor search on the training data.

Algorithm advantages

1. The classic algorithm for solving clustering problems is simple and fast;

2. When dealing with large data sets, the algorithm maintains scalability and high efficiency;

3. When the cluster is approximately Gaussian distribution, the effect is better;

4. The time complexity is close to linear, suitable for mining large-scale data sets.

1. There are no assumptions about the input data, such as not assuming that the input data is subject to a normal distribution;

2. k-NN can handle classification problems, and can naturally handle multi-classification problems, such as the classification of iris flowers;

3. Simple, easy to understand, but also very powerful. For the recognition of handwritten numbers and iris flowers, the accuracy rate is very high;

4. k-NN can also handle regression problems, that is, prediction;

5. Not sensitive to outliers;

6. It can be used for numerical data or discrete data.

algorithm shortcomings

1. The number of categories k needs to be specified in advance;

2. Sensitive to the initial value, that is, different initial values ​​may lead to different results;

3. It is not suitable for clusters with non-convex shapes or clusters with large differences in size;

4. Sensitive to noise and outliers;

5. It is a heuristic algorithm and cannot guarantee the global optimum.

1. The calculation complexity is high. The linear scan method needs to calculate the distance between the input instance and each training instance. When the training set is large, the calculation is very time-consuming; it can be improved by kd tree and other methods;

2. It relies heavily on the training sample set and has poor fault tolerance for training data. If one or two data in the training data set are wrong, just next to the value that needs to be classified, it will directly lead to the inaccuracy of the predicted data;

3. The distance measurement method and the selection of k value have a relatively large impact. If the k value is not selected properly, the classification accuracy cannot be guaranteed.

Similarity

Both include such a process. Given a point, find the nearest point in the data set, that is, both use the NN (Nearest Neighbor) algorithm, and generally use the kd tree to implement NN.

 

 

 

Guess you like

Origin blog.csdn.net/sh_0001/article/details/126153600