kNN (on)

This article is a summary of kNN learning machine learning group in the first week, the main reference is:

Machine Learning stepping stone: kNN algorithm (on)

https://mp.weixin.qq.com/s?__biz=MzUyMjI4MzE0MQ==&mid=2247484679&idx=1&sn=aec5259ee503b9b127b79e2a9661205d&chksm=f9cf74edceb8fdfb43e5dcd3279347e99fa6fb7523de2aaf301418eda6b4b14e17c24d671cd8&scene=21#wechat_redirect

Machine Learning stepping stone: kNN algorithm (in)

https://mp.weixin.qq.com/s/vvCM0vWH5kmRfrRWxqXT8Q

 

Scenarios kNN algorithm

kNN: k-NearestNeighbor, that is, looking away from the nearest sample k neighbors. Looking for the purpose of the k neighbors, is considered the highest number of tags to the tag itself can appear as a sample by the k closest neighbors. Because there are multiple classification k neighbors themselves, not just a kNN classification algorithm, and is a born multi-classification algorithm.

Principle kNN algorithm

Principle kNN algorithm is as follows:

 

Use kNN algorithm (sklearn)

. 1  from sklearn Import Datasets
 2  from sklearn.model_selection Import train_test_split, GridSearchCV
 . 3  from sklearn.neighbors Import KNeighborsClassifier
 . 4  Import numpy AS NP
 . 5  
. 6 IRIS = datasets.load_iris ()
 . 7 X-= iris.data
 . 8 Y = iris.target
 . 9  Print ( x.size, y.size)
 10  
. 11  # random_state is a SEED; may be provided test_size = 0.3, 70% random training and test 30% 
12  # careful not to mistake the return values 
13X_train, X_test, y_train, android.permission.FACTOR. Train_test_split = (X-, Y, random_state = 4798 )
 14 kNN_clf = KNeighborsClassifier (N_NEIGHBORS =. 3 )
 15  kNN_clf.fit (X_train, y_train)
 16  
. 17  "" " 
18 is  that portion of the code is mainly used for prediction and calculation accuracy.
 19  calculation accuracy logic is very simple, is to determine how much of the predicted and actual values are equal.
 20  if equal count prediction is correct, otherwise predict failure.
 21  "" " 
22 correct = np.count_nonzero ((kNN_clf.predict (X_test) == android.permission.FACTOR.) == True)
 23  Print ( " Accuracy IS:.% 3F " % (correct / len (X_test)))

 

kNN algorithm learnings

  • Training and testing samples of data organization

When we follow a certain percentage of the training set and test set resolution, the resulting data set is likely to be sequential. In the "Machine Learning stepping stone: kNN algorithm (in)" it is mentioned two ways to break up the dataset.

    • Numpy concatenate by the function X, y matrix and combined shuffle, then shuffle data segmentation scale.
DEF train_test_split_by_concatenate (X, y, split_ratio):
     '' ' 
    will be X, y matrix and combined shuffle, then the data is scaled shuffle cut into training set and test set 
    : param X: characteristic data 
    : param y: tag data 
    : param split_ratio: data segmentation ratio 
    : return: wherein the training data set, wherein the test data set, the training data set labels, notes the test data set 
    '' ' 
    tempConcat = np.concatenate ((X-, y.reshape (-1 ,. 1)), Axis =. 1 ) 
    np.random.shuffle (tempConcat) 
    shuffle_X, shuffle_y = np.split (tempConcat, [. 4], Axis =. 1 ) 
    test_size = int (len (X-) * split_ratio) 
    X_train = shuffle_X [ test_size:] 
    y_train = shuffle_y [test_size:] 
    X_test= shuffle_X[:test_size]
    y_test = shuffle_y[:test_size]
    return X_train, X_test, y_train, y_test
    • Numpy.random.permutation array obtained by the method (<len (X)) of a random index, obtain a corresponding scaled index slicing through the array, the index to find X, y corresponding data.
DEF train_test_split_by_shuffle_index (X, y, split_ratio):
     '' ' 
    to generate a random number array (wherein the random number is smaller than the data length), the corresponding index is obtained by slicing the array proportionally to find the corresponding data X, y from the index 
    : param X: characteristic data 
    : param y: tag data 
    : param split_ratio: data segmentation ratio 
    : return: wherein the training data set, wherein the test data set, the training data set labels, notes the test data set 
    '' ' 
    shuffle_index = np.random. Permutation (len (X-)) 
    test_size = int (len (X-) * split_ratio) 
    test_index = shuffle_index [: test_size] 
    train_index = shuffle_index [test_size:] 
    X_train = X-[train_index] 
    X_test = X-[test_index] 
    y_train= y[train_index]
    y_test = y[test_index]
    return X_train, X_test, y_train, y_test

Dataset scattered manner may be applied to more scenarios.

Ultra understand the parameters and model parameters as well as the use of GridSearch

Super parameters: Similar sklearn package of kNN algorithm, k is the algorithm need to pass parameters. This parameter is called hyper-parameters. Similarly a recent study of SparkALS, may be obtained by specifying a different model of ALS rank parameter values ​​maxIter plurality of parameters, to select the best model according to the test set RMSE results on different models.

sklearn be achieved grid search over a plurality of parameters by GridSearch.

 
 
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

#
The first is an array; each element of the array is a dictionary. I.e., { "weights": [ "Uniform"], "N_NEIGHBORS": [I for I in Range (. 1,. 11)]} # is a dictionary param_search = [ { " weights " : [ " Uniform " ], " N_NEIGHBORS " : [I for I in Range (. 1,. 11 )] }, { " weights " : [ " Distance " ], " N_NEIGHBORS " : [I for I in Range (. 1,p": [i for i in range(1, 6)] } ]
knn_clf = KNeighborsClassifier()

grid_search = GridSearchCV(kNN_clf, param_search)
best_kNN_clf = grid_search.estimator

'''
输出:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform')
'''

 

Guess you like

Origin www.cnblogs.com/favor-dfn/p/11827661.html