Ten classic machine learning algorithms-KNN (Nearest Neighbor Study Notes)

Ten classic machine learning algorithms-KNN (nearest neighbor)

Recently I am studying machine learning and I have consulted many people’s blogs. I have benefited a lot. Therefore, I also tried to summarize what I have learned. On the one hand, I can improve my understanding of the algorithms I have learned, and I hope to help To beginners, encourage each other. . .

1

K-nearest neighbor algorithm principle

K nearest neighbor (kNN, k-NearestNeighbor) classification algorithm, as the name suggests:

Find the nearest k neighbors (samples), and select the category with the highest frequency among the first k samples as the predicted category. What? Why is it so awkward, there is no picture to say a ball, the following is an example, it will be obvious to everyone after a diagram, as shown below:
Ten classic machine learning algorithms-KNN (Nearest Neighbor Study Notes)

Our purpose is to predict a student's grades in math class. . .
Let me first explain a few basic concepts: each point in the figure represents a sample (here, a student), the horizontal and vertical coordinates represent the characteristics (attendance rate, homework quality), and different shapes represent categories (ie: red Represents A (excellent), green represents D (failed)).

Let’s look at the point (10, 20). It means: In mathematics class, a student’s attendance rate is 10%, and the quality of homework is 20 points, which eventually led to his final exam with a D grade (poor ). In the same way, these 6 points also represent the usual status and final results of the 6 previous students, which are called training samples. . . .

Now it’s time to achieve our prediction goal. Imagine that a semester is almost over. Zhang San is about to take an exam. He wants to know how well he can take the exam. He found his class attendance rate of 85 from the math teacher. %, the job quality is 90, so how to realize the forecast?

Zhang San can be regarded as the point (85,90)-also called the test sample. First, we calculate the distance between Zhang San and the other 6 students (training samples). The distance from point to point is believed to be learned in junior high school. Right (euclidean distance generally used).

Then select the first K nearest distances. For example, if we choose k=3, then we will find out which category the three nearest samples belong to. In this example, naturally all three are A, so we can predict Zhang San The final grade of mathematics may be A (excellent). If Li Si wants to make predictions now, according to his recent 3 two Ds and one A, then Li Si's final mathematics grade is predicted to be D. This is what I said at the beginning: select the category with the highest frequency in the first k samples as the predicted category. . .

The calculation steps are summarized as follows:

1)算距离:给定测试对象,计算它与训练集中的每个对象的距离
2)找邻居:圈定距离最近的k个训练对象,作为测试对象的近邻
3)做分类:根据这k个近邻归属的主要类别,来对测试对象分类

Well, after the appeal process, do you have a certain understanding of the basic idea of ​​the KNN algorithm.
That's it for the principle. . .

2

Advantages and disadvantages of K-nearest neighbors

Advantages of KNN algorithm:

1) Simple and effective.

2) The cost of retraining is low (changes in the category system and changes in the training set are common in the Web environment and e-commerce applications).

3) The calculation time and space are linear to the size of the training set (not too large in some cases).

4) Since the KNN method mainly relies on the surrounding limited nearby samples, rather than the method of discriminating the class domain to determine the category, the KNN method is better than other sample sets to be divided for the cross or overlap of the class domain. The method is more suitable.

5) This algorithm is more suitable for the automatic classification of class domains with a relatively large sample size, and those with a small sample size are more likely to be misclassified using this algorithm.

Disadvantages of KNN algorithm:

1) The KNN algorithm is a lazy learning method (lazy learning, basically not learning), and some active learning algorithms are much faster.

2) Category scores are not standardized (unlike probability scores).

3) The interpretability of the output is not strong, for example, the interpretability of the decision tree is strong.

4) The main disadvantage of the algorithm in classification is that when the sample is unbalanced, for example, the sample size of one class is very large, while the sample size of other classes is very small, which may result in the input of a new sample. Among the K neighbors, the samples of the large-capacity class account for the majority.

The algorithm only calculates the "nearest" neighbor samples. If the number of samples of a certain type is large, then either such samples are not close to the target sample, or such samples are very close to the target sample. In any case, the quantity does not affect the results of the operation. It can be improved by using the method of weight (the neighbor with a small distance from the sample has a large weight).

5) Large amount of calculation. The current common solution is to edit the known sample points in advance, and remove the samples that have little effect on classification in advance.

3

Python implementation of K-nearest neighbor algorithm

Friendly reminder: This code is based on Python2.7, and the numpy function library needs to be installed in advance (this is a powerful scientific computing package we commonly use). . . .

3.1 First, we introduce the code implementation steps:

1)计算已知类别数据集中的点与当前点之间的距离
2)按距离递增次序排序
3)选取与当前点距离最小的k个点
4)统计前k个点所在的类别出现的频率
5)返回前k个点出现频率最高的类别作为当前点的预测分类

3.2 Implementation: We first create a file named knn.py, the overall implementation code is as follows:

Ten classic machine learning algorithms-KNN (Nearest Neighbor Study Notes)
Ten classic machine learning algorithms-KNN (Nearest Neighbor Study Notes)

Recent articles preview:
"Popular explanation of the reason why neural network parameter initialization can not be all 0"
"10-minute quick start pytorch (1)"
"Popular explanation of hidden Markov model (HMM) -backward algorithm"
"Popular explanation of hidden Marco Model (HMM)-Viterbi Algorithm "
Recommended reading article:
An article to understand the k-nearest neighbor (k-NN) algorithm (1) the
end | an article to understand the k-nearest neighbor algorithm (k-NN) algorithm (2)
hidden Markov model
-For forward algorithm submission, please contact WeChat account qinlibo20133868 or [email protected]

全是通俗易懂的硬货!只需置顶~欢迎关注交流~

Ten classic machine learning algorithms-KNN (Nearest Neighbor Study Notes)

Guess you like

Origin blog.51cto.com/15009309/2553583