The third interview in the future: knn nearest neighbor algorithm

Here's a simple algorithm this week. I remember that the simplest algorithm I learned at the time was the K-nearest neighbor algorithm. How simple is it? It's arguably one of the simplest algorithms I've ever learned. But yeah, its role is not small.

Getting to know KNN for the first time

Nearest neighbor classification is known as a "lazy learning" algorithm. The principle is very simple, that is to classify the unlabeled cases into the class of the most similar marked cases , although this idea is very simple, but the adjacent classification is very powerful (I can't understand how powerful it is, It's simple and powerful anyway), applications, the closest thing to us is predicting whether a person likes a movie recommended to them .

Features

Simple and effective, the training phase is fast, and there is no requirement for data distribution. But without generating a model, it has limited ability to discover relationships between features (this may be related to its simplicity) , the classification stage is slow and requires a lot of memory, and features and missing data require additional processing .

principle

The KNN algorithm processes features as coordinates in a multi-dimensional feature space. If a set of features with an unknown dependent variable value is given, KNN will use the value of each feature to delineate different points in the multi-dimensional space, and then calculate the unknown dependent variable. The point that is closest to its K (which can be defined by yourself during the operation) is the closest point, and then look at these points, which category has the most points, then the given point belongs to which category. In a word, this principle is that the minority obeys the majority. As for if these points belong to the same number of categories (that is, if there are 9 neighbors, 3 belong to A, 3 B, and 3 C, I don't know what to do in this case.) Well, Let's go on to talk about this distance problem. We know that there are many formulas and methods for calculating distance. Which one should we use? I worry that traditionally, the KNN algorithm uses the Euclidean distance.

Well, the above are all summed up in my own language. I think it is relatively accurate. Forget it, let’s po out the official definition:

The so-called K-nearest neighbor algorithm is to give a training data set, for a new input instance, find the K instances (that is, the K neighbors mentioned above) that are closest to the instance in the training data set. The majority belongs to a certain class, and the input instance is classified into this class. (Look at this, precise and concise)

Note: The few neighbors selected by KNN are objects that have been correctly classified . This is also easy to understand. If there is no label, the minority will obey the majority, four to eighty-four.

The KNN classifier does not need to use the training set for training, so its training time complexity is 0, and its computational complexity is proportional to the number of documents in the training set, that is: if the total number of documents in the training set is n, then its classification The time complexity is o(n)

The distance metric generally adopts the Euclidean distance , and it is better to normalize each attribute value before use , which helps to prevent the attribute with a larger initial value range from being overweight than the attribute with a smaller initial value range.

K value

The difference in the value of K can have a significant impact on the algorithm results . In practical applications, the K value is generally selected as a smaller value, and the cross-validation method is usually used to select the optimal K value . As the number of training instances tends to infinity and K=1, the error rate does not exceed twice the Bayesian error rate, and if K also tends to infinity, the error rate tends to the Bayesian error rate. (I don’t understand this sentence, but it seems to be very powerful. Let’s put it here first, maybe I will read this article later and I will know him)

The KNN algorithm can be used not only for classification but also for regression . By finding the k nearest neighbors of a sample, and assigning the average value of the attributes of these neighbors to the sample, the attributes of the sample can be obtained. A more useful method is to give different weights to the influence of neighbors with different distances on the sample , for example, the weights are inversely proportional to the distance. The main disadvantage of this algorithm in classification is that when the samples are unbalanced, for example, the sample size of one class is large, while the sample size of other classes is small, it may lead to that when a new sample is input, the K samples of the sample are The samples of the large-capacity class in the neighbors are in the majority. The algorithm only calculates the "nearest" neighbor samples, and the number of samples of a certain class is large, then either such samples are not close to the target sample, or this kind of sample is very close to the target sample. In any case, the quantity does not affect the results of the operation. It can be improved by adopting the weight method (with the neighbor with a small distance from the sample having a large weight).

Another disadvantage of this method is the large amount of computation , because for each text to be classified, the distance to all known samples must be calculated to obtain its K nearest neighbors. At present, the commonly used solution is to clip the known sample points in advance, and remove the samples that have little effect on the classification in advance . The algorithm is more suitable for automatic classification of the class domain with a large sample size, and the class domain with a small sample size is more prone to misclassification.

(The first two paragraphs are copied. To be honest, I don’t know if it can do regression. It’s amazing. It’s a lot of insight. However, it’s not used much for regression, because there are special things for regression, such as various This kind of regression algorithm and high-level algorithm should be more powerful than the regression of this KNN, so this regression only needs to know about it.)

About K's Sassy Operation

In practical applications, the choice of K depends on the difficulty of the concepts to be learned and the number of cases in the training data. Usually K is 3~10. A common practice is to set K equal to the square root of the number of cases in the training set . However, this is not necessarily good for cross-validation, so yeah, the amount of data is not large, under the premise that the computer can run, you should use cross-validation obediently .