Python handwriting Sklearn the encapsulation algorithm kNN

Abstract: of kNN algorithm Sklearn package written in a step by step with Python.

Yesterday, through a wine bar I guess the story, describes a machine learning algorithm in a most simple: kNN (K nearest neighbor), and Python realized step by step with this algorithm. For comparison the same time, a call packet Sklearn the kNN algorithm, only five lines. Both methods converge, have the right to solve the two-class problem, that belong to the new pour wine Cabernet Sauvignon.

While calling Sklearn algorithm library, a few simple lines of code will solve the problem, I feel so cool, but actually in the black box, behind Sklearn what we did when we really do not understand. As a beginner, if you do not figure out the algorithm principle of direct transfer package, we learn only superficial, with no eggs.

So today we look at how Sklearn package kNN algorithm is implemented in Python and own it. In this way, when then we'll call Sklearn algorithm package, there will be a clearer understanding.

After reviewing the 5 lines of code in yesterday Sklearn kNN algorithm:

1from sklearn.neighbors import KNeighborsClassifier 2kNN_classifier = KNeighborsClassifier(n_neighbors=3)3kNN_classifier.fit(X_train,y_train )4x_test = x_test.reshape(1,-1)5kNN_classifier.predict(x_test)[0]

Code has been explained, with a view of today continue to deepen understanding:

image

It can be said, Sklearn call all machine learning algorithms are almost always follow this routine: the training data feed fitting algorithm selection are fit, able to calculate a model, the model has been put to forecast data feed model, predict predict, the final output, classification and regression algorithms are true.

A point worth noting is, kNN is a special algorithm that does not require training (fit) model, take a direct test data on the training set can predict the outcome. This is also why kNN algorithm is one of the simplest reason machine learning algorithms.

However, in the above Sklearn why there is also fit fitting this step does, in fact, can not, but Sklearn interface is very neat uniform, so in order to remain consistent with the majority of the algorithm training set as a model.

As we learn more of the following algorithm, you will find there are some features of each algorithm can be summarized compare.

The yesterday's handwritten code organized into a function you can not see the training process:

 1import numpy as np 2from math import sqrt 3from collections import Counter 4 5def kNNClassify(K, X_train, y_train, X_predict): 6    distances = [sqrt(np.sum((x - X_predict)**2)) for x in X_train] 7    sort = np.argsort(distances) 8    topK = [y_train[i] for i in sort[:K]] 9    votes = Counter(topK)10    y_predict = votes.most_common(1)[0][0]11    return y_predict

Then we follow the idea of ​​the map, encapsulated in the Sklearn kNN algorithm, from the bottom step by step how to write 5 lines of code that are running:

 1import numpy as np 2from math import sqrt 3from collections import Counter 4 5class kNNClassifier: 6    def __init__(self,k): 7        self.k =k 8        self._X_train = None 9        self._y_train = None1011    def fit(self,X_train,y_train):12        self._X_train = X_train13        self._y_train = y_train14        return self

First, we need to put function before rewriting Class class called kNNClassifier because Sklearn the algorithms are object-oriented, more convenient to use the class.

In __init__the definition of the initial three variable functions, k means that we choose to pass into the k-nearest neighbors point.

self._X_train And  self._y_trainfront underscores _, meaning that they as an internal private variable, not only the change in internal operation, external.

Then fit the definition of a function that is used to fit the model kNN, kNN model but does not need to fit, so we copy the data set intact again, and finally returns two sets of data itself.

Here to do some of the input variables constraints, a number of lines X_train and y_train to be the same, we have chosen is a k-nearest neighbor point number can not be illegal, such as negative or greater than the number of sample points, or subsequent calculations will go wrong. What constraints do it, you can use assert assertion:

1def fit(self,X_train,y_train):2        assert X_train.shape[0] == y_train.shape[0],"添加 assert 断言是为了确保输入正常的数据集和k值,如果不添加一旦输入不正常的值,难找到出错原因"3        assert self.k <= X_train.shape[0]4        self._X_train = X_train5        self._y_train = y_train6        return self

Next we pass into the sample point to be predicted, it is calculated with the distance between each sample point, in the corresponding Sklearn predict, which is the core part of the algorithm. This step is a function of the code we wrote before, you can take over the direct use, plus a few lines of input variables to ensure that assertion is reasonable.

 1def predict(self,X_predict): 2        assert self._X_train is not None,"要求predict 之前要先运行 fit 这样self._X_train 就不会为空" 3        assert self._y_train is not None 4        assert X_predict.shape[1] == self._X_train.shape[1],"要求测试集和预测集的特征数量一致" 5 6        distances = [sqrt(np.sum((x_train - X_predict)**2)) for x_train in self._X_train] 7        sort = np.argsort(distances) 8        topK = [self._y_train[i] for i in sort[:self.k]] 9        votes = Counter(topK)10        y_predict = votes.most_common(1)[0][0]11        return y_predict

Here we have completed a simple Sklearn kNN algorithm package, save it as kNN_sklearn.pya file, and then run the test in jupyter notebook:

First obtain basic data:

 1# 样本集 2X_raw = [[13.23,  5.64], 3       [13.2 ,  4.38], 4       [13.16,  4.68], 5       [13.37,  4.8 ], 6       [13.24,  4.32], 7       [12.07,  2.76], 8       [12.43,  3.94], 9       [11.79,  3.  ],10       [12.37,  2.12],11       [12.04,  2.6 ]]12X_train = np.array(X_raw)1314# 特征值15y_raw = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]16y_train = np.array(y_raw)1718# 待预测值19x_test= np.array([12.08,  3.3])20X_predict = x_test.reshape(1,-1) 

Note: Only when a predictor, we must reshape (1, -1) into a two-dimensional array or will be error.

In jupyter notebook running program can use a magic command% run:

1%run kNN_Euler.py

This runs directly kNN_Euler.py good program, and then you can call kNNClassifier class program, given parameter k is 3, a named instance kNN_classify.

1kNN_classify = kNNClassifier(3)

Then the sample set X_train, y_train Example pass fit:

1kNN_classify.fit(X_train,y_train)

to be passed fit after a good prediction sample X_predict to predict the classification results can be obtained:

1y_predict = kNN_classify.predict(X_predict)2y_predict34[out]:1

The answer is 1 and yesterday the results of the two methods is the same.

It is not difficult?

Still further, if we once predicted more than a point, but several points, such as the following two points to predict which type:

image

It can not simultaneously give classification results predicted it? The answer of course is yes, we only need to slightly modify the above algorithm can be a package, the predict function modified as follows:

 1def predict(self,X_predict): 2        y_predict = [self._predict(x) for x in X_predict]  # 列表生成是把分类结果都存储到list 中然后返回 3        return np.array(y_predict) 4 5def _predict(self,x):  # _predict私有函数 6        assert self._X_train is not None 7        assert self._y_train is not None 8 9        distances = [sqrt(np.sum((x_train - x)**2)) for x_train in self._X_train]10        sort = np.argsort(distances)11        topK = [self._y_train[i] for i in sort[:self.k]]12        votes = Counter(topK)13        y_predict = votes.most_common(1)[0][0]14        return y_predict

This defines two functions, with the list Predict formula classification storing a plurality of predicted value, the predicted value come, that is, the use of  _predict function calculation, _predict leading underscore that it is the same function package private, internal use only, outside You can not call, because no.

Written algorithm, only need to pass multiple prediction samples on it, where we pass two:

1X_predict = np.array([[12.08,  3.3 ],2        [12.8,4.1]])

Output prediction:

1y_predict = kNN_classify.predict(X_predict)2y_predict34[out]:array([1, 0])

See, the two return values, the classification results of the first sample is a Cabernet Sauvignon i.e., the second sample result is 0, i.e., Pinot Noir. And actual results are consistent, it is perfect.

Here, we follow Sklearn algorithm packages wrote kNN algorithm, but kNN algorithm Sklearn in much more complex than this, because there are many kNN algorithm to consider, such as handling  a drawback kNN algorithm: time-consuming calculations . It simply is kNN algorithm running time is highly dependent on the number of sets of samples have dimensions and characteristics of value, the algorithm run time speed increases when the dimension is high, and the specific reasons for ways to improve our follow-up talk.

Guess you like

Origin blog.csdn.net/zhoulei124/article/details/91042929