Supervised learning algorithm 1: K- nearest neighbor (KNN)

First explain a few concepts
 
Machine learning can be divided into: supervised learning and unsupervised learning.
 
* Supervised learning: data from the known classes focus on learning a function that can be predicted or classification of new data, data sets, including features and target values, that is the standard answer; common type of algorithm can be divided into: classification and regression.
Common classification algorithm: K- nearest neighbor (KNN), Naive Bayes, decision trees, random forest, logistic regression, neural network regression used to predict, such as prices, common algorithms: linear regression, ridge regression
 
· Unsupervised Learning: The main difference between supervised learning is that no human data set marked target, that is, there is no standard answer; common algorithms: clustering, generated against the network.
 
K- nearest neighbor algorithm
This is a machine learning algorithm in a simple, look at the definition of
 
Definition: If a sample with the feature space K samples from recently, most of the K samples belong to categories A, then the sample also belong to A category.
Popular, that is to be determined by your nearest K neighbors your category; for example, are now required in your area, A in Chaoyang District, in Fangshan District, the distance between you and ABC B in Haidian District, C in the Haidian District, D They are: 20,28,23,35; K value takes 3, then this three your nearest neighbor is a, B, C, there are two of these three belong Haidian District, K- nearest neighbor algorithm according to you Haidian District region is located; if K is 1, that is, Chaoyang District; which also shows the results of K-nearest neighbor algorithm is largely affected by the value of K, the value K is usually not more than 20 rounded.
 
How K- nearest neighbor distance calculation?
Here the distance Euclidean distance is generally used, the following formula:
For example there are two samples: A (a1, a2, a3), B (b1, b2, b3)
 
 
From these two samples is calculated as:
\sqrt{(a1-b1)^{2}+(a2-b2)^2+(a3-b3)^2}
 
As can be seen from the formula, if the weight is characterized a1, a2 is the height, then the different units of these two features, and the larger the value of a2, the results of the impact is large, it is necessary to use the nearest neighbor algorithm before K- concept of data were normalized (all related to the distance calculating algorithm will need to be standardized), a standardized description here:
 
K- neighbors advantages and disadvantages and application scenarios
  • Advantages: easy to implement, without estimating parameters, no training, not sensitive to outliers
  • Disadvantages: large computation, memory consumption, the value of K greater reliance
  • Suitable numerical data of little scene
scikit-learn in K- neighborhood API is: sklearn.neighbors.KNeighborsClassifier
 
Built in conjunction with the following data set (data set iris) in scikit-learn, simple to use scikit-learn described in K- neighbor:
 
# Scikit Learn-built-in data set in this module datasets, iris data set imported 
from sklearn.datasets Import load_iris
 # introduced K- neighbors the API 
from sklearn.neighbors Import KNeighborsClassifier
 # introduced standardized the API 
from sklearn.preprocessing Import StandardScaler
 # import partitioned data set the API 
from sklearn.model_selection import train_test_split 

# loading data 
IRIS = load_iris ()
 # partitioned data set, the following Detailed Definition 
x_train, x_test, y_train, y_test = train_test_split (iris.data, iris.target, test_size = 0.25, random_state = None)

train_test Split cut into a data set for the training set and test set used to train the model training set, used to validate the model test set; first characteristic value is a parameter, the second parameter is a target value; Test _size for setting a test set of a float type, it indicates the ratio of the test set, the general value of 0.25, if the type is an int, is the number of the test set; return values are: training set characteristic value, the test set of feature values, training set target value, the test set target value, here represented by the feature value x, y represents a target value; random_state random state, default to None, i.e. not fixed, randomly scattered data, each set of tests run to return and the training set is different, if it is of type int, which represents the group number of random numbers, for example, each run are set to 1, then the resulting data sets are the same;
 
# Of eigenvalues training and test sets were normalized 
std = StandardScaler () 
x_train = std.fit_transform (x_train) 
x_test = std.transform (x_test) 

# instantiated K nearest neighbor, you need to manually specify the parameters of the algorithm instance said ultra parameter, n_neighbors is above that value K 
KNN = KNeighborsClassifier (N_NEIGHBORS = 3 ) 

# the training set into the model train 
knn.fit (x_train, y_train) 

# using a test set of eigenvalues forecast, predicted return target value 
y_predict = knn.predict (x_test) 

# feature value and target value input test sets was evaluated 
Print ( ' accuracy rate: ' ) 
knn.score (x_test, android.permission.FACTOR.)

knn.score used to assess the accuracy of the model
 
Output:
 
Accuracy rate: 0.9473684210526315
 
Here has introduced the use of scikit-learn in KNN, but the model is built on top of very limited, said KNN above results were affected by affecting the value of K, the above examples only a test, we do not know this is not a K value the most suitable, but only a set of data segmentation generally requires neighbor with K- ( K- fold ) cross-validation and grid search used with
 
What is K - fold cross-validation?
 
The initial sample is divided into K sub-samples, a single sub-sample is kept as data validation model, the other K-1 samples used for training. Cross-validation was repeated K times, once for each sub-sample verification result on average K times, or in combination with other ways to finally obtain a single estimate. The advantage of this method is that, while the repeatable random sub-sample of training and validation, each time verification result, 10 fold cross-validation is the most common.
5-fold cross validation, for example, all of the available data set into a set of five, each iteration of which is selected from a set of data as a validation set, and four sets as the training set, groups 5 through the iterative process. Cross-validation of the benefits that can ensure that all data has the opportunity to be trained and validated, but also the greatest extent possible model performance optimization of the performance of the more credible, to avoid over-fitting.
 
What is a grid search?
 
By traversing a given combination of parameters, select the best model parameters.
 
API grid search is: sklearn.model_selection.GridSearchCV, which achieve cross-validation
 
Examples of binding above grid search optimization as follows:
 
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# 加载数据
iris = load_iris()

# 数据集切分
x_train,x_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.25,random_state=None)

# 标准化处理
std = StandardScaler()
x_train = std.fit_transform(x_train)
x_test =std.transform (x_test) 

KNN = KNeighborsClassifier () 

# configured search parameters values 
param_grid = { " N_NEIGHBORS " : [. 3,. 5, 10 ]} 

# trellis search 
gc = GridSearchCV (knn, param_grid = param_grid, cv = 2)

 

The first parameter is an estimate GridSearchCV object; param_grid estimated parameter; CV refers to several-fold cross validation, as used herein, fold 2
# Training model 
gc.fit (x_train, y_train) 

Print ( " choose the best model is: " , gc.best_estimator_)
 Print ( " accuracy on the test set: " , gc.score (x_test, android.permission.FACTOR.))

 

Output:

Select the best model is: KNeighborsClassifier (= algorithm ' Auto ' , leaf_size = 30, Metric = ' Minkowski ' , 
           metric_params = None, n_jobs = None, N_NEIGHBORS =. 5, P = 2 , 
           weights = ' Uniform ' ) 
in the test set the accuracy rate: 0.9473684210526315

 

The output from the model can be seen best, N_NEIGHBORS is 5, the optimal model.
 
Summary, introduces:
1, from the idea and the K- nearest neighbor of operation;
2, model tuning: grid search and cross validation;

Guess you like

Origin www.cnblogs.com/chaofan-/p/11105461.html