Python implementation k- nearest neighbor algorithm machine learning

Recorded at the simplest recent study k- nearest neighbor started.
A similar article and others even quite similar, why is this so? Because it is stepping on the shoulders of our predecessors over acridine.

1. What is the k- nearest neighbor

K-nearest neighbor ( k-Nearest Neighbor, KNN ) classification algorithm, is a more mature approach in theory, one of the most simple machine learning algorithm.
Idea of the method is: in the feature space, if a sample of the k nearest vicinity of (i.e., nearest neighbor feature space) most of the samples belonging to a certain category, the sample may also fall into this category.

2. The principle and the advantages and disadvantages

The official's words, the so-called K-nearest neighbor algorithm, that is, given a training data set, the new input instance, in the training data set to find the nearest instance of K instances (also known as the K neighbors above ), most of the K belongs to a class instance, put the input is classified into the class instance.
In other words: the known properties of the sample, the new sample temporarily unknown properties. First find the distance between two points, properties of the new sample is the sample from the sample nearest attribute.

About the following advantages and disadvantages of k- nearest neighbor is quoted - machine learning principles, advantages and disadvantages -K neighbor (KNN) algorithm.

Algorithms advantages:

  1. Simple, easy to understand, easy to implement, without estimating parameters.
  2. Training time is zero. Training it does not show, unlike other supervised training algorithm will use a model train set (that is, a fitting function), then the validation set or test set by the classification model. KNN just save up the sample, and then receive the processing test data, KNN training time is zero.
  3. KNN can handle classification problems, and can handle multiple natural classification, suitable for rare events are classified.
  4. Especially suitable for multi-classification problems (multi-modal, object has multiple category labels), KNN is better than the performance of SVM.
  5. KNN can also handle back problem, it is forecast.
  6. And the ratio of naive Bayes algorithm or the like, the data is not assumed, high accuracy, is insensitive to outliers.

Algorithms disadvantages :

  1. Too computationally intensive, especially when very large number of features. Each text should be classified calculate its distance to all known samples, to get its first K nearest neighbor points.
  2. Intelligibility is poor, can not give that kind of like a decision tree rule.
  3. Powder is lazy learning, learning is not substantially, resulting in a logistic regression prediction algorithm than the speed of the slow type.
  4. Unbalanced time, the prediction of rare category of low accuracy. When the sample imbalance, such as a class of large sample size, and other types of sample size is very small, it may cause when entering a new sample, K neighbors of the sample in the sample large-capacity class of the majority.
  5. Data dependence on training particularly large, fault tolerance, poor training data. If the training data set, there are one or two figures are wrong, good and just need to be classified in the next value, which would directly lead to inaccurate forecasting data.

3. Code Example

A look at the following example, the classification is based primarily on the value of X-axis and Y-axis. ( Not rigorous place, please understand )
Here Insert Picture DescriptionHere Insert Picture Description
here will be used in high school, when learning of the distance formula between two points , of course, from here is two-dimensional coordinates.
Here Insert Picture Description
There are two types of sample data A yellow and green B, now adds a data asterisk, then the asterisk is dependent on what kind of data it?
Although it seems that Class A, but as long as it appears in this article would be meaningless.
Use the following code to implement it.
First, the sample data set.

import numpy as np


# dataset()为训练数据集
def dataset():
    """
    :return: 返回样本数据以及标签
    """
	# dataset_为训练数据集, lables为数据集对应的标签
    dataset_ = np.array([[10, 10], [17, 20], [20, 30], [100, 80], [110, 70], [120, 90]])
    lables = ['A', 'A', 'A', 'B', 'B', 'B']
    return dataset_, lables

python achieve k- nearest neighbor.

def classify_knn(new_array, dataset, lables, k):
    """
    :param new_array: 新实例
    :param dataset:   训练数据集
    :param lables:    训练集标签
    :param k:         最近的邻居数目
    :return:          返回算法处理后的结果
    """
    # np.shape是取数据有多少组,然后生成tile生成与训练数据组相同数量的数据
    # 然后取平方
    # np.sum(axis=1) 取行内值相加,然后开发,求出两点之间的距离
    datasetsize = dataset.shape[0]
    diffmat = np.tile(new_array, (datasetsize, 1)) - dataset
    sqrdiffmat = diffmat ** 2
    distance = sqrdiffmat.sum(axis=1) ** 0.5

    # np.argsort将数值从小到大排序输出索引
    # dict的get返回指定键的值,如果值不在字典中返回默认值。
    # 根据排序结果的索引值返回靠近的前k个标签
    sortdistance = distance.argsort()
    count = {}
    for i in range(k):
        volelable = lables[sortdistance[i]]
        count[volelable] = count.get(volelable, 0) + 1
    count_list = sorted(count.items(), key=lambda x: x[1], reverse=True)
    return count_list[0][0]
    

if __name__ == '__main__':
    dataset, lables = dataset()
    result = classify_knn([40, 50], dataset, lables, 5)
    print(result)	# A

The asterisk coordinates [10, 50], the result returned is A ,
the asterisk coordinates [100, 10], the result returned by the B .
bingo! ! !

4. Summary

Three steps:

  1. Prepare training data set
  2. Calculate a new instance of the training data set of K nearest instances
  3. come to conclusion

Because of poor personal skills, machine learning this entry k- nearest neighbor took me several hours to figure out a Diudiu. He says it is really laughable.
Errors and omissions in the article, I implore you hesitate to correct me.
Finish.

Published 34 original articles · won praise 210 · views 20000 +

Guess you like

Origin blog.csdn.net/weixin_45081575/article/details/103578815