Using cross-validation (Cross Validation) for model assessment

  • Scikit-learn using the default cross-validation fold cross-validation is K (K-fold cross validation): it will be split into k portions of a data set, then the k sets of data to train the model and score.

1.K fold cross-validation (K-fold cross validation)

############################# using cross validation to evaluate the model ############# ########################## 
# import wine data set 
from sklearn.datasets import load_wine 
# cross validation tool introduced 
from sklearn.model_selection import cross_val_score 
# import support vector machine for classification 
from sklearn.svm Import SVC 
# Loading data set wine 
wine load_wine = () 
# SVC set of kernel function Linear 
SVC = SVC (Kernel = 'Linear') 
# using cross validation method SVC score 
scores = cross_val_score (SVC, wine.data, wine.target, CV =. 3) 
# print result 
print ( 'cross-validation score: {}' format (scores) .)
Cross-validation Score: [0.83333333 0.95]
# Use .mean () to obtain a fraction of the average 
print ( 'cross-validation Average: {:. 3f}'. Format (scores.mean ()))
Average cross-validation: 0.928
# Set parameter cv. 6 
Scores = cross_val_score (SVC, wine.data, wine.target, cv =. 6) 
# print result 
print ( 'cross-validation score: \ n {}' format ( scores).)
Cross-validation Score: 
[0.9 0.93333333 0.96666667 0.86666667 1. 1.]
# Calculated average cross-validation 
print ( 'cross-validation Average: {:. 3f}' format ( scores.mean ()).)
Average cross-validation: 0.944
# Wine classification label print data set 
print ( 'wine classification tag: \ n {}'. Format (wine.target))
Wine category labels: 
[0 0,000,000,000,000,000,000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0,000,111,111,111,111,111 
 . 1 1,111,111,111,111,111,111 . 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1 
 1,111,111,111,111,111,111 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 
 2 2 2 2 2 2 2 2 2 2 2,222,222,222,222,222,222 2]
  • If K is not layered folded cross-validation, the data set at the time of splitting, it is possible in each subset are the same label, so the model will not be too high score, fold cross validation stratified k advantages law is that it will split in every different category, make sure the label have a consistent focus on a number of different categories for each child.

2. Differential random cross-validation (shuffle-split cross-validation)

# Import Tool stochastic difference 
from sklearn.model_selection Import ShuffleSplit 
# split parts is provided 10 
shuffle_split = ShuffleSplit (test_size = .2, .7 = train_size, n_splits = 10) 
# Resolution of good cross-validation data set 
= cross_val_score scores (SVC, wine.data, wine.target, CV = shuffle_split) 
# cross-validation score printing 
print ( 'cross-validation model score randomly split: \ n-{}' the format (scores).) 
# calculate the average cross-validation partial 
print ( 'cross-validation randomly split average: {:. 3f}' format ( scores.mean ()).)
Split random cross-validation model score: 
[0.94444444 0.97222222 0.97222222 0.97222222 0.94444444 0.97222222 
 0.97222222 0.97222222 0.94444444 1] 
randomly split cross-validation Average: 0.967

3. A test a (leave-one-out)

  • The principle and k fold cross-validation is similar, except that it put every data point as a test set, so the test set number of samples, it is necessary iteration number of times. For small data sets, its score is Highest
# Import LeaveOneOut 
from sklearn.model_selection Import LeaveOneOut 
# cv parameter set leaveoneout 
cv = LeaveOneOut () 
# re-cross-validation 
Scores = cross_val_score (SVC, wine.data, wine.target, cv = cv) 
# iterations printing 
print ( ' Print iterations: {} 'the format (len (scores))). 
# printed score results 
print (' model average:. {:. 3f} 'format (scores.mean ()))
Print number of iterations: 178 
Model Average: 0.955

to sum up: 

  Why should we use the cross-validation?  

  When we use the split train_test_split approach to data collection, train_test_split using a random split, if we split the time, the test set are relatively easy to classify data or regression, but are more difficult to focus on training then scoring model will be high, and vice versa scoring model will be low. we also less likely to put all random_state traverse again, and cross-validation made up for this shortcoming, it works to cause it to be more than split times scored and then averaged, so there is no question we said earlier.

 

Quoted from the article: "layman's language python machine learning"

Guess you like

Origin www.cnblogs.com/weijiazheng/p/10963882.html