Machine learning - k nearest neighbor algorithm for clustering and use of python


Written at the beginning, to paraphrase my teacher’s words, the forefront of all natural subjects is currently studying mathematics, especially machine learning, which requires even higher mathematics. Every time we open a machine learning book, there are a lot of mathematical formulas, various partial derivatives, integrals, etc. It is really a headache to look at, especially for someone like me who has a poor foundation in mathematics. Man, it's simply torture. However, if it is difficult to return and if you are unwilling to try, it will become a regret, and this regret will last a lifetime.
I like this sentence now: We persevere all the way, not because we want to have high achievements, but just so that we can have fewer regrets when we get old! ! !

What is k nearest neighbor algorithm

k-Nearest Neighbor (kNN) learning is a commonly used supervised learning method. Its working mechanism is very simple: given a test sample, find the k training samples closest to it in the training set based on a certain distance metric. Then make a prediction based on the information of these k " neighbors ". Usually, the "voting method" can be used in the classification task, that is, the category label that appears most in the k samples is selected as the prediction result; in the regression task, the "average method" can be used "Method", that is, the average value of the real-valued output markers of these k samples is used as the prediction result; a weighted average or weighted voting can also be performed based on the distance. The closer the distance, the greater the weight of the sample.
Compared with other machine learning algorithms, k-nearest neighbor has almost no training process! In other words, there is no explicit training process. In fact, it is a well-known representative of "lazy learning". This type of learning technology only saves the samples during the training phase, and the training time overhead is almost zero. After receiving the test samples, Processing in progress.
The mathematical knowledge required by kNN is almost 0, and it is very simple to understand.

k nearest neighbor algorithm process

Next, let’s take a look at its algorithm process:
1. Calculate the distance between the data to be tested and the existing data;

2. Sort according to the increasing relationship of distance;

3. Select the K points with the smallest distance;

4. Take the most common category among the K points as the category of the data to be tested.

Let's use an example to understand this process.
Insert image description here
The blue squares and red triangles in the picture are training samples, and the green ones are test samples. Next, we classify the green circles and divide them according to the selected k (number of neighbors) value. When k is 3, that is, the three training samples surrounded by the solid circles in the figure, the number of red triangles in the three samples There are more than blue squares, so under this k value, the green circles should be divided into red triangles; similarly, when k is 5, there are three more blue squares than red triangles, so at this time they should be divided into Blue square.

Have you understood the kNN algorithm? Let’s talk about something more advanced. The nearest we understand above is the so-called Euclidean distance, but the eigenvalues ​​of the sample do not necessarily rely on the Euclidean distance to judge the distance. , so let's look at some other "distances":

Hamming distance: the number of different corresponding positions of two strings. Hamming distance is named after Richard Wesley Hamming. In information theory, the Hamming distance between two strings of equal length is the number of different characters in the corresponding positions of the two strings. In other words, it is the number of characters that need to be replaced to transform one string into another string;

Mahalanobis distance: represents the covariance distance of the data. Calculate the distance between the similarity of two sample sets;

Cosine distance: The angle between two vectors is used as a measure of discriminative distance;

Manhattan distance: the sum of the distances between two points projected onto each axis;

Chebyshev distance: the maximum distance between two points projected onto each axis;

Standardized Euclidean distance: Each term in the Euclidean distance is divided by the standard deviation.

In addition, there is another distance called Minkowski distance, which is defined as follows:
Insert image description here

When q is 1, it is the Manhattan distance. When q is 2, it is the Euclidean distance.

Although a lot has been introduced at once, everyone still feels unclear, but don’t worry. The definition of distance is a core concept in machine learning, and you will often encounter it in subsequent studies. One of the purposes of introducing distance here is to allow everyone to use the k-nearest neighbor algorithm. If you find that the effect is not very good, you can try to improve the performance of the algorithm by using different distance definitions.

Use sklearn for code implementation

Dataset introduction

sklearn has a built-in data set to be used below - the red wine data set, which can be called using the following two lines of code:

from sklearn.datasets import load_wine
wine_dataset = load_wine()

This is the result of printing the red wine data set:

{
    
    'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
         1.065e+03],
        [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
         1.050e+03],
        [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
         1.185e+03],
        ...,
        [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
         8.350e+02],
        [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
         8.400e+02],
        [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
         5.600e+02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2]),
 'target_names': array(['class_0', 'class_1', 'class_2'], dtype='<U7'),
 'DESCR': '.. _wine_dataset:\n\nWine recognition dataset\n------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 178 (50 in each of three classes)\n    :Number of Attributes: 13 numeric, predictive attributes and the class\n    :Attribute Information:\n \t\t- Alcohol\n \t\t- Malic acid\n \t\t- Ash\n\t\t- Alcalinity of ash  \n \t\t- Magnesium\n\t\t- Total phenols\n \t\t- Flavanoids\n \t\t- Nonflavanoid phenols\n \t\t- Proanthocyanins\n\t\t- Color intensity\n \t\t- Hue\n \t\t- OD280/OD315 of diluted wines\n \t\t- Proline\n\n    - class:\n            - class_0\n            - class_1\n            - class_2\n\t\t\n    :Summary Statistics:\n    \n    ============================= ==== ===== ======= =====\n                                   Min   Max   Mean     SD\n    ============================= ==== ===== ======= =====\n    Alcohol:                      11.0  14.8    13.0   0.8\n    Malic Acid:                   0.74  5.80    2.34  1.12\n    Ash:                          1.36  3.23    2.36  0.27\n    Alcalinity of Ash:            10.6  30.0    19.5   3.3\n    Magnesium:                    70.0 162.0    99.7  14.3\n    Total Phenols:                0.98  3.88    2.29  0.63\n    Flavanoids:                   0.34  5.08    2.03  1.00\n    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12\n    Proanthocyanins:              0.41  3.58    1.59  0.57\n    Colour Intensity:              1.3  13.0     5.1   2.3\n    Hue:                          0.48  1.71    0.96  0.23\n    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71\n    Proline:                       278  1680     746   315\n    ============================= ==== ===== ======= =====\n\n    :Missing Attribute Values: None\n    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%[email protected])\n    :Date: July, 1988\n\nThis is a copy of UCI ML Wine recognition datasets.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\n\nThe data is the results of a chemical analysis of wines grown in the same\nregion in Italy by three different cultivators. There are thirteen different\nmeasurements taken for different constituents found in the three types of\nwine.\n\nOriginal Owners: \n\nForina, M. et al, PARVUS - \nAn Extendible Package for Data Exploration, Classification and Correlation. \nInstitute of Pharmaceutical and Food Analysis and Technologies,\nVia Brigata Salerno, 16147 Genoa, Italy.\n\nCitation:\n\nLichman, M. (2013). UCI Machine Learning Repository\n[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,\nSchool of Information and Computer Science. \n\n.. topic:: References\n\n  (1) S. Aeberhard, D. Coomans and O. de Vel, \n  Comparison of Classifiers in High Dimensional Settings, \n  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Technometrics). \n\n  The data was used with many others for comparing various \n  classifiers. The classes are separable, though only RDA \n  has achieved 100% correct classification. \n  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) \n  (All results using the leave-one-out technique) \n\n  (2) S. Aeberhard, D. Coomans and O. de Vel, \n  "THE CLASSIFICATION PERFORMANCE OF RDA" \n  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of \n  Mathematics and Statistics, James Cook University of North Queensland. \n  (Also submitted to Journal of Chemometrics).\n',
 'feature_names': ['alcohol',
  'malic_acid',
  'ash',
  'alcalinity_of_ash',
  'magnesium',
  'total_phenols',
  'flavanoids',
  'nonflavanoid_phenols',
  'proanthocyanins',
  'color_intensity',
  'hue',
  'od280/od315_of_diluted_wines',
  'proline']}

The red wine in the data set has thirteen features and three categories. Next, we use this data set to train and predict.

standardization

Let’s first take a look at the mean and standard deviation of the red wine data set sample:

from sklearn.datasets import load_wine
wine_dataset = load_wine()
print(wine_dataset.data.mean(0))
print(wine_dataset.data.std(0))

The printed results are as follows:

[1.30006180e+01 2.33634831e+00 2.36651685e+00 1.94949438e+01 9.97415730e+01 2.29511236e+00 2.02926966e+00 3.61853933e-01 1.59089888e+00 5.05808988e+00 9.57449438e-01 2.61168539e+00 7.46893258e+02]
[8.09542915e-01 1.11400363e+00 2.73572294e-01 3.33016976e+00 1.42423077e+01 6.24090564e-01 9.96048950e-01 1.24103260e-01 5.70748849e-01 2.31176466e+00 2.27928607e-01 7.07993265e-01 3.14021657e+02]

It can be seen from the printing results that the mean and standard deviation of some features are relatively large, such as the last feature. If the kNN algorithm is now used to classify such data, the kNN algorithm will consider the last feature to be more important. Because assuming that the last feature values ​​of two samples are 1 and 100 respectively, the distance between the two samples may be determined by the last feature. This is likely to affect the accuracy of the kNN algorithm. To solve this problem, we can normalize the data.

There are many means of standardization, and the most commonly used one is StandardScaler. StandardScaler standardizes features by removing the mean and scaling to unit variance, and turns the standardized result into a mean of 0 and a standard deviation of 1.

Assume that the feature after normalization is z, the feature before standardization is x, the mean of the feature is μ, and the variance is s. Then StandardScaler can be expressed as
z = ( x − μ ) / sz=(x−μ)/sz=( x μ ) / s
The relevant interface code of sklearn is as follows:

from sklearn.preprocessing import StandardScaler
data = [[0, 0], [0, 0], [1, 1], [1, 1]]
# 实例化StandardScaler对象
scaler = StandardScaler()
# 用data的均值和标准差来进行标准化,并将结果保存到after_scaler
after_scaler = scaler.fit_transform(data)
# 用刚刚的StandardScaler对象来进行归一化
after_scaler2 = scaler.transform([[2, 2]])
print(after_scaler)
print(after_scaler2)

Print the result:

[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
[[3. 3.]]

According to the printed results, it can be seen that after accurate transformation, the data has been scaled to a distribution with a mean of 0 and a standard deviation of 1.

Code

Don’t say anything, let’s start with the code:

from sklearn.datasets import load_wine
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

def classification(train_feature, train_label, test_feature):
    '''
    对test_feature进行红酒分类
    :param train_feature: 训练集数据,类型为ndarray
    :param train_label: 训练集标签,类型为ndarray
    :param test_feature: 测试集数据,类型为ndarray
    :return: 测试集数据的分类结果
    '''
    #调用模型
    clf = KNeighborsClassifier()
    #使用模型进行训练
    clf.fit(train_feature,train_label)
    #返回预测结果
    return clf.predict(test_feature)

def score(predict_labels,real_labels):
    '''
    对预测的结果进行打分,仅考虑测试集准确率!!!
    '''
    num = 0.
    lenth = len(predict_labels)
    for i in range(lenth):
        if predict_labels[i] == real_labels[i]:
            num = num + 1
    print("预测准确率:",num / lenth)


#加载红酒数据集
wine_dataset = load_wine()

#对数据集进行拆分,X_train、X_test、y_train、y_test分别代表
#训练集特征、测试集特征、训练集标签和测试集标签
X_train, X_test, y_train, y_test = train_test_split(wine_dataset['data'],wine_dataset['target']
                                ,test_size=0.3)

#这是数据没有标准化直接进行训练和预测的结果
print("未进行数据标准化直接训练的模型")
predict1 = classification(X_train,y_train,X_test)
score(predict1,y_test)

print("\n")

#这是数据标准化后的预测结果

#加载标准化模型
scaler = StandardScaler()

#进行数据标准化
train_data = scaler.fit_transform(X_train)
test_data = scaler.fit_transform(X_test)
print("标准化之后训练的模型")
predict2 = classification(train_data,y_train,test_data)
score(predict2,y_test)

The above directly encapsulates training and prediction in a function to facilitate comparison of models obtained from standardized and unstandardized data. The
final result:
Insert image description here
It can be seen that the prediction accuracy of the model without standardization is only 0.7+, while the standardization The final model accuracy is 0.98.

The blog refers to Zhou Zhihua's "Machine Learning", the EduCoder platform machine learning training guide training project, etc.

Guess you like

Origin blog.csdn.net/qq_44725872/article/details/108939815
Recommended