Comparison of several cross validation methods

The purpose of model evaluation: Through model evaluation, we know how good or bad the current training model is, and what is the generalization ability? So you know if it can be applied to solve the problem, if not, where is the problem?

train_test_split

In classification problems, we usually divide the training set into two parts: train and test by performing train_test_split, where train is used to train the model, test is used to evaluate the model, and the model is learned from the train data set by the fit method, and then the score method is called in the Evaluate and score on the test set; from the score, we can know how the current training level of the model is.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

cancer = load_breast_cancer()
X_train,X_test,y_train,y_test = train_test_split(cancer.data,cancer.target,random_state=0)

logreg = LogisticRegression().fit(X_train,y_train)
print("Test set score:{:.2f}".format(logreg.score(X_test,y_test)))

Output:
output: Test set score:0.96
However, this method is stored: only one division is performed, and the data results are accidental. If in a certain division, the training set is full of easy-to-learn data, and the test set is full of complex data, which will lead to The end result was less than desirable; and vice versa.

Standard Cross Validation

Aiming at the drawbacks of the above model evaluation method through the division of train_test_split, Cross Validation cross-validation is proposed.
Cross Validation: In short, it is to perform multiple train_test_split divisions; for each division, training and test evaluation are performed on different data sets to obtain an evaluation result; if it is 5-fold cross-validation, it means that the original data On the set, 5 divisions are performed, and each division is performed once for training and evaluation, and finally the evaluation results after 5 divisions are obtained. Generally, the final score is obtained by averaging the evaluation results of these several times. k-fold cross-validation, where k generally takes 5 or 10.
standard Cross validation

demo:

from sklearn.model_selection import cross_val_score

logreg = LogisticRegression()
scores = cross_val_score(logreg,cancer.data, cancer.target) #cv:默认是3折交叉验证,可以修改cv=5,变成5折交叉验证。
print("Cross validation scores:{}".format(scores))
print("Mean cross validation score:{:2f}".format(scores.mean()))

output:

Cross validation scores:[0.93684211 0.96842105 0.94179894]
Mean cross validation score:0.949021

Advantages of cross-validation:

  • In the original train_test_split method, the data division is contingent; cross-validation is divided into multiple times, which greatly reduces the contingency caused by a random division. At the same time, through multiple divisions and multiple trainings, the model can also encounter various sample data, thereby improving its generalization ability;
  • More efficient use of data compared to original train_test_split. train_test_split, the default ratio of training set and test set is 3:1, and for cross-validation, if it is 5-fold cross-validation, the ratio of training set to test set is 4:1; for 10-fold cross-validation, the ratio of training set to test set is 9: 1. The larger the amount of data, the higher the accuracy of the model!

    shortcoming:

  • This short-answer cross-validation method, as can be seen from the above picture, divides the data evenly each time, imagine whether there will be a situation: the data set has 5 categories, and the extracted data is exactly There are 5 categories divided by category, that is to say, the first fold is all 0 categories, the second fold is all 1 categories, and so on; this result will lead to the model training, without learning the characteristics of the data in the test set, so This results in a very low model score, even 0,! In order to avoid this situation, various other cross-validation methods have emerged.

Stratified k-fold cross validation

Stratified k-fold cross validation: First of all, it belongs to the type of cross-validation. Stratified means that the proportional relationship of each category in the original data is maintained in each fold, for example: the original data has 3 class, the ratio is 1:2:1, and 3-fold hierarchical cross-validation is used, then in the divided 3-fold, the data category of each fold maintains the ratio of 1:2:1, and the verification result is more credible.
Usually, cv parameters can be set to control the number of folds, but we want to control its division, etc., so KFold appears, KFold controls the division folds, can control the number of divided folds, whether to disrupt the order, etc., can be assigned to cv , used to control the division.
Standard Cross Validation vs Hierarchical Cross Validation

demo:

from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedKFold,cross_val_score
from sklearn.linear_model import LogisticRegression

iris = load_iris()
print('Iris labels:\n{}'.format(iris.target))
logreg = LogisticRegression()
strKFold = StratifiedKFold(n_splits=3,shuffle=False,random_state=0)
scores = cross_val_score(logreg,iris.data,iris.target,cv=strKFold)
print("straitified cross validation scores:{}".format(scores))
print("Mean score of straitified cross validation:{:.2f}".format(scores.mean()))

output:

Iris labels:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
straitified cross validation scores:[0.96078431 0.92156863 0.95833333]
Mean score of straitified cross validation:0.95

Leave-one-out Cross-validation

Leave-one-out Cross-validation: It is a special cross-validation method. As the name suggests, if the sample size is n, then k=n, n-fold cross-validation is performed, and one sample is left for validation each time. Mainly for small sample data.

demo:

from sklearn.datasets import load_iris
from sklearn.model_selection import LeaveOneOut,cross_val_score
from sklearn.linear_model import LogisticRegression

iris = load_iris()
print('Iris labels:\n{}'.format(iris.target))
logreg = LogisticRegression()
loout = LeaveOneOut()
scores = cross_val_score(logreg,iris.data,iris.target,cv=loout)
print("leave-one-out cross validation scores:{}".format(scores))
print("Mean score of leave-one-out cross validation:{:.2f}".format(scores.mean()))

output:

Iris labels:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
leave-one-out cross validation scores:[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1.]
Mean score of leave-one-out cross validation:0.95

Shuffle-split cross-validation

More flexible control: you can control the number of division iterations, the ratio of test set and training set in each division (that is, there can be cases where neither the training set nor the test set is present);
shuffle split cross validation
demo:

from sklearn.datasets import load_iris
from sklearn.model_selection import ShuffleSplit,cross_val_score
from sklearn.linear_model import LogisticRegression

iris = load_iris()
shufspl = ShuffleSplit(train_size=.5,test_size=.4,n_splits=8) #迭代8次;
logreg = LogisticRegression()
scores = cross_val_score(logreg,iris.data,iris.target,cv=shufspl)

print("shuffle split cross validation scores:\n{}".format(scores))
print("Mean score of shuffle split cross validation:{:.2f}".format(scores.mean()))

output:

shuffle split cross validation scores:
[0.95      0.95      0.95      0.95      0.93333333 0.96666667
0.96666667 0.91666667]
Mean score of shuffle split cross validation:0.95

My blog will be moved and synchronized to Tencent Cloud + Community, and everyone is invited to join: https://cloud.tencent.com/developer/support-plan?invite_code=2l6rvdr2fmcko

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324455467&siteId=291194637