[Machine Learning Algorithm] 1. KNN

1. KNN

k-Nearest Neighbor algorithm, also called KNN algorithm and K nearest neighbor algorithm.
It is the simplest and most basic introductory algorithm in machine learning.
Before knowing any algorithm, you must first understand the data structure suitable for this algorithm, so we start with the data suitable for KNN.

1. Prepare data

2. Data exploration

3. The KNN principle and related mathematics
calculate the distance between the predicted sample and the training sample, find the labels of the K training samples closest to the predicted sample, and then use the principle of majority-voting to compare the predicted sample with K The nearest neighbor samples whose categories belong to a larger proportion are classified into one category. The guiding principle is "those who are close to vermillion are red, and those who are close to ink are black", that is, your neighbors will infer your category.

4. Package adjustment implementation: sklearn.neighbors.KNeighborsClassifier()

This is the KNN algorithm in the machine learning framework sklearn. I recommend that you just adjust the package and use it. Because the principle of KNN is very simple, you can also write an algorithm yourself. However, if your data volume is millions of samples or even tens of millions of samples, then the algorithm you write yourself will be unsatisfactory, because The packaged algorithm in sklearn not only realizes the nearest neighbor distance calculation, but also optimizes the indexing speed. That is, the training data is fed into the model. After the model is fit, the algorithm automatically stores the training data in a tree structure for subsequent predictions. It can be indexed quickly, and this function is realized by building kd tree and other algorithms. The algorithm of building kd tree is not good at algorithm engineers. It requires a lot of computer underlying optimization technology, which is a key consideration for development engineers, so I do not recommend it. Everyone can write this algorithm by hand and just use it directly. Moreover, KNN is an extremely simple entry-level algorithm. In many cases, it is used to test the potential of data, so there is no need to write it by hand. So the following is a detailed introduction to the parameters of KNeighborsClassifier():

5. KNN modeling prediction

from sklearn.neighbors import KNeighborsClassifier    #模型
from sklearn.preprocessing import LabelBinarizer      #处理标签列

#(1)处理标签
binarized = LabelBinarizer()
label = binarized.fit_transform(data['性别'])  #label是一个二维数据

#(2)建模
k=5   #必填参数
knn = KNeighborsClassifier(n_neighbors = k)   #实例化一个KNN分类器
clf = knn.fit(data.iloc[:, :-1], label.ravel())   #要把label拉平,变成一维的

#(3)预测
predict = clf.predict(x)
predicted_label = binarized.inverse_transform(predict)
predicted_label
array(['female', 'male'], dtype='<U6')

6. Summary:
(1) The KNN algorithm was first used for classification tasks, but with the optimization of the algorithm, regression prediction can also be performed.
(2) KNN is a lazy model, that is, it does not need to learn in advance. When it starts to predict, it traverses the training data one by one to calculate the distance. So the biggest disadvantage of this model is that it is slow.
(3) When the samples are unbalanced, the algorithm almost fails.
(4) This algorithm is also not suitable when the sample distributions overlap.
(5) Pay attention to the label. The label must be numerical and one-dimensional.
(6) K setting. It is best to set an odd number of K's to prevent a draw. When k=1, it becomes the nearest neighbor model.

Below I use code to generate two types of data, and then visualize it:    

It can be seen that if your data is like the picture on the left, it is not appropriate for you to use KNN. If your data distribution is similar to the picture on the right, then the effect of KNN is still good. Therefore, the quality of an algorithm mainly depends on whether it matches your data, so the prerequisite for using any algorithm is to know your data very well. Of course, the data I am exemplifying here only has 2 features and can be visualized. In practice, the data we often encounter is high-dimensional and cannot be visualized at all. At this time, KNN can only be used to see the potential of the data. Gadget.

7. Supplement:
KNN can also be adapted to multi-label classification scenarios. Multi-Label Machine Learning (MLL algorithm) is the situation where there are multiple y values ​​in the prediction model. Of course, it’s not just KNN, but also decision trees, linear regression, SVM, and RidgeCV. Here is a multi-label regression prediction case of KNN:

#制作训练集和测试集:我们随机挑5张图片作为测试集,剩下的图片全部作为训练集。然后把训练集的人脸的上半部分作为特征,下半部分作为多标签
import random
random.seed(6)
choice = [random.randint(0, 40)*10 for i in range(5)]  #[360, 50, 310, 160, 20]我们挑出这5张作为测试集的索引

test_img = data.data[choice]    #这是测试集的5张全脸图片
train_img = data.data[[i for i in list(range(400)) if i not in choice]]  #这是训练集的395张全脸

trainx = train_img.reshape(395, 64, 64)[:, :32, :]  #这是训练集的395张上半脸
trainy = train_img.reshape(395, 64, 64)[:, 32:, :]  #这是训练集的395个标签
testx = test_img.reshape(5, 64, 64)[:, :32, :]    #这是测试集的5张上半脸
from sklearn.neighbors import KNeighborsRegressor
estimator1 = KNeighborsRegressor()
estimator1 = estimator1.fit(trainx.reshape(395, 2048), trainy.reshape(395, 2048))
predict1 = estimator1.predict(testx.reshape(5,2048))
from sklearn.ensemble import ExtraTreesRegressor
estimator2 = ExtraTreesRegressor(n_estimators=10, max_features=32, random_state=123)
estimator2 = estimator2.fit(trainx.reshape(395, 2048), trainy.reshape(395, 2048))
predict2 = estimator2.predict(testx.reshape(5,2048))
from sklearn.linear_model import LinearRegression
estimator3 = LinearRegression()
estimator3 = estimator3.fit(trainx.reshape(395, 2048), trainy.reshape(395, 2048))
predict3 = estimator3.predict(testx.reshape(5,2048))
from sklearn.linear_model import RidgeCV
estimator4 = RidgeCV()
estimator4 = estimator4.fit(trainx.reshape(395, 2048), trainy.reshape(395, 2048))
predict4 = estimator4.predict(testx.reshape(5,2048))

 The prediction effects of the above four algorithms are as follows:  

#这是绘图代码
plt.figure(figsize=(12, 8))
plt.subplot(1,5,1), plt.imshow(test_img[0].reshape(64,64), cmap='gray'), plt.xticks([]), plt.yticks([])
plt.subplot(1,5,2), plt.imshow(test_img[1].reshape(64,64), cmap='gray'), plt.xticks([]), plt.yticks([])
plt.subplot(1,5,3), plt.imshow(test_img[2].reshape(64,64), cmap='gray'), plt.xticks([]), plt.yticks([])
plt.subplot(1,5,4), plt.imshow(test_img[3].reshape(64,64), cmap='gray'), plt.xticks([]), plt.yticks([])
plt.subplot(1,5,5), plt.imshow(test_img[4].reshape(64,64), cmap='gray'), plt.xticks([]), plt.yticks([])  

plt.figure(figsize=(12, 8))
plt.subplot(1,5,1), plt.imshow(np.concatenate((testx[0], predict1[0].reshape(32, 64)), axis=0), cmap='gray'), plt.xticks([]), plt.yticks([])
plt.subplot(1,5,2), plt.imshow(np.concatenate((testx[1], predict1[1].reshape(32, 64)), axis=0), cmap='gray'), plt.xticks([]), plt.yticks([])
plt.subplot(1,5,3), plt.imshow(np.concatenate((testx[2], predict1[2].reshape(32, 64)), axis=0), cmap='gray'), plt.xticks([]), plt.yticks([])
plt.subplot(1,5,4), plt.imshow(np.concatenate((testx[3], predict1[3].reshape(32, 64)), axis=0), cmap='gray'), plt.xticks([]), plt.yticks([])
plt.subplot(1,5,5), plt.imshow(np.concatenate((testx[4], predict1[4].reshape(32, 64)), axis=0), cmap='gray'), plt.xticks([]), plt.yticks([])

Hahaha. . . Did you find that this multi-label return is quite amazing! It can actually be used as a generative model!

Guess you like

Origin blog.csdn.net/friday1203/article/details/135056449