Python Machine Learning Guide from Scratch (3) - kNN Classification of Supervised Learning

introduce

This blog will combine examples to introduce 监督学习/Supervised Learning/SLanother large branch: 分类/Classification. To be precise, the classification algorithm we will use is 邻近算法/k-nearest neighbors/kNN.

Preparation before starting

Before starting, please make sure you have the following packages in your python environment:
pandas, numpy, sklearn, seaborn.

All the code in this article can be run Anacondain Jupyter Lab.

text

分类/ClassificationWhat is the nature of the problem?
Let’s review the issues discussed in the previous blog 回归/Regression. The essence of the regression problem is to find a regression line for unknown data points that can predict them most accurately. For example, given some continuous variable as input, we can get the continuous variable as output.

Classification problems are similar and require finding a model that can most accurately describe the relationship between feature sets and label sets. But the difference is that the label sets of classification problems are all non-continuous variables. For example, given a person's age, the person can only be classified as one of adults or minors; given a picture with an animal, we must accurately identify and distinguish the type of animal. In other words, the classification problem can be understood as, for the unknown mapping $f$ ，
$f:\mathbb{R}^n\mapsto \mathbb{L}$
, we have it定义域/Domain( $\mathbb{R}^n$ is the feature set) and对应域/Codomain( $\mathbb{L}$ is some array of label sets). We want to use the existing data to find a model that best fits the data. It should be noted that the label set $\mathbb{L}$ is a finite set. Therefore, what is different from the regression function is that the classification function can be非参数模型/nonparametric function, that is, it does not require parameters (such as the weight set $\bf w$ ) can achieve classification of data. The simplest example is the KNN algorithm that will be introduced in this article.

What is KNN? How does it achieve classification?
We can find the answer from its name k-nearest neighbors: for a new data point, the algorithm will determine which label data it is closest to, similar to "the one who is close to red is red, the one who is close to ink is black." When we see something we have never seen before, we humans will compare it with the most similar thing we have seen.

For a new data feature $x=[x_1\ x_2\ \dots \ x_n]$ , kNN algorithm needs to consider $k$ neighbors closest to the data characteristics, and use the labels of these neighbors to determine the label of the new data. In other words, we draw an N-dimensional ball with the input unknown data as the center of the ball, so that the data points in the ball are exactly $k$ inside the ball $The k$ points are the nearest neighbors of the input data that the algorithm needs to consider. There are usually two methods of judgment:

多数决规则/Majority Rule. The label that has the most labels among the neighbors will be selected; if there is a tie, the label will be chosen randomly. In the example below, the question mark data will be classified as B when k=3 and as A when k=7.
基于距离的规则/Distance-based Rule. Neighbors that are closer together have a higher weight: the closer they are, the higher the weight. Finally, the label with the highest weighted average is selected. In the lower right plot, the green data is closer to the question mark data, so they have a higher weight in determining the question mark data label.

In order to implement the second rule and determine nearest neighbors, we also need to define a measure of distance. For two data features $x$ 和 $y$ between the two $d$ has the following definitions:
3. $L_1$ 曼哈顿距离/Manhattan Distance： $d(x,y)=\sum_{i=1}^{n} |{x_i-y_i}|$ . For two points on the plane, this formula calculates two points $x$ 和 $The sum of the y-$ coordinate differences (the sum of the lengths of the right-angled sides of the right triangle in the figure below).
4. $L_2$ 欧几里得距离/Euclidian Distance： $d(x,y)=\sqrt{\sum_{i=1}^{n} |{x_i-y_i}|^2}$ . For two points on the plane, this formula calculates the straight-line distance between the two points (the length of the hypotenuse of the right triangle in the figure below).
5. $L_\infin$ 切比雪夫距离/Chebyshev Distance： $d(x,y)=\max_{1\le i\le n}|x_i-y_i|$ .. For two points on the plane, this formula calculates two points $x$ 和 $The maximum value of the y-$ coordinate difference (the longest right-angled side of the right triangle in the figure below).
distance formula
From the above, we can also see that KNN is a model非参数模型/nonparametric function, because the decision of the model does not depend on the optimal weight, but on all input data. Therefore, in order to maintain the authenticity and comprehensiveness of the data, kNN models are usually larger in size.

The model 超参/Hyperparametersonly has the number of neighbors $k$ , and we have to decide which label determination rule and distance calculation method to use, so in many cases we can try different combinations to find the most suitable model.

One last point: not all data is suitable for KNN classification. When the data is not well separated (that is, data with different labels overlap a lot), the performance of kNN will be poor. At this time, we have to consider changing a model, 数据工程/Data Engineeringor adding additional information to the data to separate labels.

code

After understanding the principle, we can use python to implement the above kNN classification algorithm.

We first import the required libraries.

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

The data set we used this time is sklearnincluded in the package and is used to classify iris flowers. The data has a total of 150 rows and 4 columns. The feature set includes the length and width of the petals and sepals of the flower. The label set is the species to which the iris belongs (Setosa/Mountain Iris, Versicolour/variegated iris, Virginica/Virginia iris). ).

from sklearn.datasets import load_iris 
iris = load_iris() # 从sklearn中引入数据集

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names) # 创建DataFrame

iris_df['Iris species'] = iris.target # 在DataFrame里加入标签集
print(iris_df) # 看看长什么样

The picture below shows the output of the above code. We can see that the feature set has four features (i.e. the first four columns of the DataFrame) and the label set is the iris species represented by the numbers.

We can do some simple visualization. seabornIt is a very useful drawing library that can save us time in drawing different permutations and combinations of variables. code show as below:

import seaborn as sb

sb.set(style="ticks", color_codes=True) # 设置视觉效果

g = sb.pairplot(iris_df, hue="Iris species", diag_kind='hist')

We can see that the picture below is the result drawn by seaborn. From the figure we can observe the relationship between different variables. As mentioned above, kNN needs to have separable label classes. Overall, class 0 is well separated from the other classes, but there seems to be some overlap between classes 1 and 2. We can guess from this simple observation that the model will have a higher accuracy in classifying class 0 and a lower accuracy between classes 1 and 2.

Next we train the model. Training the model is also very simple, as follows:

from sklearn.neighbors import KNeighborsClassifier
neighbors_num = 10 # 考虑的邻居数量
weights = 'uniform' # 多数决规则

classifier = KNeighborsClassifier(n_neighbors=neighbors_num, weights=weights)

classifier.fit(iris.data, iris.target) # 学习数据

Our model has been trained, now let's think about how to evaluate its performance. sklearnThe method used is called 0-1损失函数/Zero-one Loss. To put it simply, the correct classification is recorded as 0 and the incorrect classification is 1. Add up the scores of all classifications and divide them by the total number of data. The definition is as follows:
$\bar{E} = \frac{1}{m}\sum_{t \in all\ data} {\bf 1}_{clf(t) \ne t}$
In other words, the accuracy is $1-\bar{E}$ . We can sklearn.metricsuse it

from sklearn import metrics 

Y_pred = classifier.predict(iris.data) # 模型预测的标签集

accuracy = metrics.accuracy_score(iris.target, Y_pred)

print('Training accuracy of kNN: {:.3f}'.format(accuracy))
# Training accuracy of kNN: 0.980

It can be seen that the accuracy is still very high. But how do we know which data is misclassified? We can 混淆矩阵/Confusion Matrixdraw an intuitive judgment:

from sklearn.metrics import confusion_matrix

def show_confusion_matrix(true_labels, learned_labels, class_names):
	# 用sklearn创建混淆矩阵对象
    cmat = confusion_matrix(true_labels, learned_labels) 
    
	# 设置图像大小
    plt.figure(figsize=(14, 5))
    plt.tick_params(labelsize=8)
    
    # 画出热度图
    hm = sb.heatmap(cmat.T, square=True, annot=True, fmt='d', cbar=True,
                     xticklabels=class_names,
                     yticklabels=class_names, 
                     cmap="seismic", 
                     annot_kws={
    
    "size":12}, cbar_kws={
    
    'label': 'Counts'})

    # 添加图例
    hm.figure.axes[-1].yaxis.label.set_size(10)
    hm.figure.axes[-1].tick_params(labelsize=8)

	# 增加坐标轴标题
    plt.xlabel('True label', fontsize=9)
    plt.ylabel('Predicted label', fontsize=9)
    
    plt.show()

Y_test = iris.target # 真实标签
show_confusion_matrix(Y_test, Y_pred, iris.target_names)

We get the following confusion matrix. It can be seen that, just like our guess, the model produced some errors when distinguishing type 1 (versicolor) and type 2 (virginica). The accuracy of our model is already quite high, but if the accuracy is low, we should consider adding more data samples to categories where the model is prone to errors, or provide more useful features for the data (such as blade thickness, etc.) .
confusion matrix
Of course, the number chosen above k=10is a number we chose arbitrarily. Is there a better number of neighbors? We can try different k values and 交叉验证/Cross-validationevaluate the performance of different models, as follows:

from sklearn.model_selection import cross_validate

features = iris.data # 特征集
labels = iris.target # 标签集

k_fold = 10

for k in [1,2,3,4,5,6,7,8,9,10,20,50]:
    classifier = KNeighborsClassifier(n_neighbors=k, weights='uniform')
    classifier.fit(features, labels)

    cv_results = cross_validate(classifier, features, labels, 
                                cv=k_fold, return_train_score=True)

    print('[{}-NN] Mean test score: {:.3f} (std: {:.3f})'
          '\nMean train score: {:.3f} (std: {:.3f})\n'.format(k,
                                                  np.mean(cv_results['test_score']),
                                                  np.std(cv_results['test_score']),
                                                  np.mean(cv_results['train_score']),
                                                  np.std(cv_results['train_score'])))
'''
[1-NN] Mean test score: 0.960 (std: 0.053)
Mean train score: 1.000 (std: 0.000)

[2-NN] Mean test score: 0.953 (std: 0.052)
Mean train score: 0.979 (std: 0.005)

[3-NN] Mean test score: 0.967 (std: 0.045)
Mean train score: 0.961 (std: 0.007)

[4-NN] Mean test score: 0.967 (std: 0.045)
Mean train score: 0.964 (std: 0.007)

[5-NN] Mean test score: 0.967 (std: 0.045)
Mean train score: 0.969 (std: 0.007)

[6-NN] Mean test score: 0.967 (std: 0.045)
Mean train score: 0.973 (std: 0.008)

[7-NN] Mean test score: 0.967 (std: 0.045)
Mean train score: 0.973 (std: 0.006)

[8-NN] Mean test score: 0.967 (std: 0.045)
Mean train score: 0.980 (std: 0.006)

[9-NN] Mean test score: 0.973 (std: 0.033)
Mean train score: 0.979 (std: 0.006)

[20-NN] Mean test score: 0.980 (std: 0.031)
Mean train score: 0.974 (std: 0.013)

[50-NN] Mean test score: 0.927 (std: 0.036)
Mean train score: 0.933 (std: 0.017)
'''

As can be seen from the above, k=1it seems to be the best trained one, but in this case it is obvious that the model is overfitted, because the training accuracy of the model is much higher than the validation accuracy. k=50The accuracy was much lower because we did not have enough samples (only 150) and considered too many neighbors, which caused the model's decision-making to be more misleading. We should try our best to choose one with high training accuracy and verification accuracy and small difference, so k=9it is a better choice.

Conclusion

K-均值聚类/K-means ClusteringIn the next blog, the blogger will introduce how to implement it using algorithms in unsupervised learning 分类/Classification. If you have any questions or suggestions, please feel free to comment or send a private message. Coding is not easy. If you like the blogger’s content, please like and support!