Principles of machine learning --K neighbor (KNN) algorithm and the advantages and disadvantages

A, KNN algorithm theory

  K-nearest neighbors (k-nearst neighbors, KNN) is a very basic method of machine learning.

  The basic idea is: in the case of the training set data and the tag is known, the input test data, the test data of the characteristics corresponding to the training set characteristics compared with each other, to find the most similar to the training set with the previous K data, the test data corresponding to the category is the highest number that appears in the classification K data.

  KNN algorithm description:

    (1) calculate the distance between the respective training data and test data;

    (2) sorted in order of increasing distance relationship;

    (3) Select the minimum distance K points;

    (4) to determine the frequency of occurrence of the first K categories where points

      (5) returns the first point in the K most frequently occurs as a predicted category classification test data.

  Algorithms advantage: training time is zero.

  Algorithm Cons: too computationally intensive. Each text should be classified calculate its distance to all known samples, to get its first K nearest neighbor points.

 

Second, code implementation

Import numpy AS NP
 Import matplotlib.pyplot AS plt
 from sklearn.datasets Import make_blobs   # make_blobs clustering data generator 
from sklearn.neighbors Import KNeighborsClassifier    # KNeighborsClassfier K nearest neighbor classification 
# sklearn Python language-based machine learning tools and support, including classification, regression, dimensionality reduction and clustering four machine learning algorithms. 
#   Further includes a feature extraction, data processing and model evaluation by three modules. 
#   Sklearn.datasets (public) data set; sklearn.neighbors nearest neighbor 


Data = make_blobs (N_SAMPLES = 5000, Centers =. 5, random_state =. 8 )
 # N_SAMPLES total number of samples to be generated, Sample sample, sample 
# Centers in the center of the sample to be generated number 
# seed randon_state random generator
X-, y = Data
 # return value, X samples generated data set; y tag sample data set 

plt.scatter (X-[:, 0], X [:, . 1], y = C, = plt.cm CMap. Spring, edgecolor = ' K ' )
 # c color, cmap colormap colormap entity or a name, cmap only when c is used only when a floating-point array. 

CLF = KNeighborsClassifier () 
clf.fit (X-, Y) 


x_min, x_max = X-[:, 0] .min () -. 1, X-[:, 0] .max () +. 1 
y_min, y_max = X-[:, . 1] .min () -. 1, X-[:,. 1] .max () +. 1 

XX, YY = np.meshgrid (np.arange (x_min, x_max, 0.02 ), 
                  np.arange (y_min, y_max, 0.02 ) ) 
the Z = clf.predict (np.c_ [xx.ravel (), yy.ravel ()]) 
the Z =Z.reshape (xx.shape) 
plt.pcolormesh (XX, YY, the Z, CMap = plt.cm.Pastel1) 
plt.scatter (X-[:, 0], X-[:, . 1], Y = C, = CMap plt.cm.spring, edgecolor = ' K ' ) 
plt.title ( ' the KNN-. Classifier ' ) 
plt.scatter ( 6.88,4.18, marker = ' * ' , 200 is = S, C = ' R & lt ' ) 
plt.xlim ( [x_min, x_max]) 


Print ( ' the result of running the model built as follows: ' )
 Print ( ' ======================= ' )
 Print ( ' the new category is added to the sample: ', clf.predict ([[6.72,4.29 ]])) 

Print ( ' the model classification accuracy for the secondary data set is: {}. 2F :. ' .format (clf.score (X-, Y)))

Output:

The result of running the model built as follows:
=======================
new category is added to the sample: [1]
The model for the classification of the secondary data set correct rate: 0.96



 

Guess you like

Origin www.cnblogs.com/lsm-boke/p/11756173.html