KNN algorithm - sklearn library implementation

KNN algorithm

1. Definition 
       In pattern recognition, the k-neighbor algorithm (k-nn) is a nonparametric method used for classification and regression. The so-called K neighbors is the k nearest neighbors, which means that each sample can be represented by its nearest k neighbors. 
2. The core idea of ​​k-nn 
        The core idea of ​​the knn algorithm is that if most of the k nearest samples in the feature space of a sample belong to a certain category, the sample also belongs to this category and has characteristics of the sample. (Given a data to be classified, the k samples closest to it are obtained by distance calculation, and the k samples vote to decide which category the data to be classified is classified into - "minority obeys majority") 
3. Algorithm process

  • Calculate the distance between the sample data and the data to be classified.

  • Select the k samples that are closest to the data to be classified.

  • Count the categories to which most of the k samples belong, and record this category as category.

  • The data to be classified belongs to category.

4. Algorithm advantages and disadvantages

(1) Advantages

  • Simple, easy to understand, easy to implement.

  • Good for classifying rare events.

  • Suitable for multi-classification problems (multi-modal, objects have multiple class labels), k-nn is more suitable than svm.

(2) Disadvantages

  • When the samples are unbalanced, such as the sample size of one class is large, and the sample size of other classes is small, it may cause that when a new sample is input, the samples of the large-capacity class among the K neighbors of the sample are the majority. The algorithm only calculates the "nearest" neighbor samples, and the number of samples of a certain class is large, then either such samples are not close to the target sample, or this kind of sample is very close to the target sample. In any case, the quantity does not affect the results of the operation.

  • The amount of calculation is large, because for each text to be classified, the distance from it to all known samples can be calculated to obtain its K nearest neighbors.

Iris data set as an example to describe the k-nn algorithm

from sklean import datasets,neighbors

iris = datasets.load_iris()
X = iris.data
y = iris.target

kn_clf = neighbors.KNeighborsClassifier()
kn_clf.fit(X,y)
kn_y = kn_clf.predict(X)
print(kn_y)
print(y)

accuracy_knn = (kn_y == y).astype(int).mean()
print(accuracy_knn)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • Output of kn_y and y: 
    write picture description here

  • Accuracy accuracy_knn: 
    write picture description here

iris

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324643679&siteId=291194637