Machine Learning: Cross Validation -- Evaluating Predictive Performance

Learning and testing a predictive model on the same data set makes a methodological mistake: the model only has a high score on seen samples, but cannot predict effectively on unseen samples. This situation is called overfitting . In order to avoid this situation, a common approach is to take a part of the samples as a test set when doing a supervised learning experiment X_test, y_test. Keep in mind that "experimental" is not intended to mean only academic use, as even in commercial settings, machine learning often begins experimentally.

In , the data set can be quickly and randomly divided into a training set and a test set scikit-learnthrough functions . train_test_splitLet's load iristhe dataset and fit an SVM:

>>> import numpy as np
>>> from sklearn import cross_validation
>>> from sklearn import datasets
>>> from sklearn import svm

>>> iris = datasets.load_iris()
>>> iris.data.shape, iris.target.shape
((150, 4), (150,))

We can now quickly take out 40% of the data as a test set to test our classifier:

>>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(
...     iris.data, iris.target, test_size=0.4, random_state=0)

>>> X_train.shape, y_train.shape
((90, 4), (90,))
>>> X_test.shape, y_test.shape
((60, 4), (60,))

>>> clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)                           
0.96...

When calculating different settings for the predictor (parameters like SVMmust Cbe set manually), there is still a risk of overfitting on the test set, because the parameters are constantly adjusted until the predictor performs optimally. At this time, information on the test set may "leak" to the model, so that the calculated metrics will no longer report generalization performance. In order to solve this problem, take out a part of the data set called the "validation set": train on the training set, then evaluate on the verification set, and then do the final evaluation on the test set when it seems to be close to success.

However, splitting the original data set into three parts will significantly reduce the number of samples that can be used to learn the model, and the results may depend on （train_set，validation_set）a specific random selection.

The step to solve this problem is called cross-validation cross-validation. A test set should still come up with classes for final evaluation, but a validation set will no longer be needed when doing cross-validation. In the basic solution, K-fold cross-validation, the training set is split into k smaller sets (other solutions are described below, but generally follow the same principles). The following steps are for each k-th fold:
　　1. Use k-1 folds as a training set to train the model
　　2. The resulting model is verified on the remaining data (used as a test set to calculate performance indicators such as accuracy)

Finally, the average of all calculated metric values within the loop is calculated as the metric reported by K-fold cross validation. This solution is computationally expensive, but does not waste too much data (because it fixes an arbitrary test set). One of the main advantages of this problem is that, like inverse inference, the number of samples is small. .

3.1.1. Calculate cross-validation indicators

The simplest way to use cross-validation is to call the call cross_val_scorefunction on both the predictor and the dataset.

The following example demonstrates how to estimate iristhe accuracy of a linear kernel SVM on a data set by splitting the data, fitting a model five consecutive times and calculating the score (each time on a different split data set):

>>> clf = svm.SVC(kernel='linear', C=1)
>>> scores = cross_validation.cross_val_score(
...    clf, iris.data, iris.target, cv=5)
...
>>> scores                                              
array([ 0.96...,  1.  ...,  0.96...,  0.96...,  1.        ])

The mean and 95% confidence interval of the scores are:

>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)

The default score calculated at each CV iteration is the predictor scoremethod. scoringThis can be changed by using parameters:

>>> from sklearn import metrics
>>> scores = cross_validation.cross_val_score(clf, iris.data, iris.target,
...     cv=5, scoring='f1_weighted')
>>> scores                                              
array([ 0.96...,  1.  ...,  0.96...,  0.96...,  1.        ])

In iristhis example, the target classes are balanced so F1-score and accuracy are always equal.

When cvthe argument is an integer, cross_val_scorethe default is to use KFold the or StratifiedKFoldstrategy, the latter will be used if the predictor is derived ClassifierMixinfrom it .

It is also possible to use other cross-validation strategies by passing in a cross-validation iterator, for example:

>>> n_samples = iris.data.shape[0]
>>> cv = cross_validation.ShuffleSplit(n_samples, n_iter=3,
...     test_size=0.3, random_state=0)

>>> cross_validation.cross_val_score(clf, iris.data, iris.target, cv=cv)
...                                                     
array([ 0.97...,  0.97...,  1.        ])

为拿出的数据进行数据转化

与在从训练集上拿出去的数据上测试预测器同等重要的是，预处理（例如,标准化，特征选择...）和相似的数据转化相似的从训练集上学习也应该应用到拿出去的数据上。

>>> from sklearn import preprocessing
>>> X_train, X_test, y_train, y_test = cross_validation.train_test_split(
...     iris.data, iris.target, test_size=0.4, random_state=0)
>>> scaler = preprocessing.StandardScaler().fit(X_train)
>>> X_train_transformed = scaler.transform(X_train)
>>> clf = svm.SVC(C=1).fit(X_train_transformed, y_train)
>>> X_test_transformed = scaler.transform(X_test)
>>> clf.score(X_test_transformed, y_test)  
0.9333...

Pipeline 可以很容易地在交叉验证下提供这种行为：

>>> from sklearn.pipeline import make_pipeline
>>> clf = make_pipeline(preprocessing.StandardScaler(), svm.SVC(C=1))
>>> cross_validation.cross_val_score(clf, iris.data, iris.target, cv=cv)
...                                                 
array([ 0.97...,  0.93...,  0.95...])

3.1.1.1. Obtaining predictions through cross-validation

cross_val_predicthas cross_val_scorea similar excuse, but returns, for each element in the input, a prediction if that element is in the test set. This can only be used if the cross-validation strategy assigns all elements exactly to a test set (otherwise, an exception will be raised).

These predictions can be used to evaluate the classifier:

>>> predicted = cross_validation.cross_val_predict(clf, iris.data,
...                                                iris.target, cv=10)
>>> metrics.accuracy_score(iris.target, predicted) 
0.966...

Keep in mind that the results of this calculation may cross_val_scorediffer slightly from those obtained using groups of elements.

3.1.2. Cross-validation iterator

The following sections list tools for splitting datasets based on different cross-validation strategies.

3.1.2.1 K-fold

KFoldDivide all samples into k groups (if k=n, it is equivalent to leave-one-out verification), and each group is of equal size. Then k-1 fold is used for learning, and the remaining fold is used for testing.

Below is a 2-fold cross-validation with 4 samples:

>>> import numpy as np
>>> from sklearn.cross_validation import KFold

>>> kf = KFold(4, n_folds=2)
>>> for train, test in kf:
...     print("%s %s" % (train, test))
[2 3] [0 1]
[0 1] [2 3]

Each fold has two arrays: the first is related to the training set, and the second is the test set. Therefore, numpythe index can be used to create training/test sets:

>>> X = np.array([[0., 0.], [1., 1.], [-1., -1.], [2., 2.]])
>>> y = np.array([0, 1, 0, 1])
>>> X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]

Stratified k-fold

StratifiedKFoldYes k-foldvariant, which returns hierarchical folds: each set contains approximately an equal percentage of sample labels.

3-fold stratified cross-validation on an unbalanced dataset with 10 samples:

>>> import numpy as np
>>> from sklearn.cross_validation import StratifiedKFold

>>> labels = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
>>> skf = StratifiedKFold(labels, 3)
>>> for train, test in skf:
...    print("%s %s"%(train,test))
...    print("%s %s" % (labels[train], labels[test]))
...    print("- - - - - - - - - - - - - - - - - - - - - - - - - -")
[2 3 6 7 8 9] [0 1 4 5]
[0 0 1 1 1 1] [0 0 1 1]
- - - - - - - - - - - - 
[0 1 3 4 5 8 9] [2 6 7]
[0 0 0 1 1 1 1] [0 1 1]
- - - - - - - - - - - - 
[0 1 2 4 5 6 7] [3 8 9]
[0 0 0 1 1 1 1] [0 1 1]
- - - - - - - - - - - -

Label k-fold

LabelKFoldYes k-foldvariant, it ensures that the same label does not appear in both the test and training datasets. This is necessary if your data is obtained from different subjects and you want to train and test on different subjects to avoid overfitting.

Imagine you have 3 topics, each associated with 1, 2, and 3:

>>> from sklearn.cross_validation import LabelKFold

>>> labels = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3, 3])

>>> lkf = LabelKFold(labels,n_folds=3)
>>> for train, test in lkf:
...    print("%s %s"%(train,test))
...    print("%s %s" % (labels[train], labels[test]))
...    print("- - - - - - - - - - - - ")
[0 1 2 3 4 5] [6 7 8 9]
[1 1 1 2 2 2] [3 3 3 3]
- - - - - - - - - - - - 
[0 1 2 6 7 8 9] [3 4 5]
[1 1 1 3 3 3 3] [2 2 2]
- - - - - - - - - - - - 
[3 4 5 6 7 8 9] [0 1 2]
[2 2 2 3 3 3 3] [1 1 1]
- - - - - - - - - - - -

Each subject is on a different test fold, and the same subject is never present on both the training and test sets. Note that due to data imbalance, each fold will not be exactly the same size.

Leave-One-Out - LOO

LeaveOneOut (LOO) is a simple cross-validation. Each learning set will be constructed from all samples excluding one sample, and the test set is the deleted sample. So, for n samples, we have n different training sets and n different test sets. This cross-validation step does not waste too much data because only one sample is removed from the training set:

>>> from sklearn.cross_validation import LeaveOneOut

>>> loo = LeaveOneOut(4)
>>> for train, test in loo:
...     print("%s %s" % (train, test))
[1 2 3] [0]
[0 2 3] [1]
[0 1 3] [2]
[0 1 2] [3]

Potential users of LOO should weigh several known caveats when making model selection. Compared with k-fold cross-validation, it builds n models instead of k from n samples, here

. In addition, each model has

rather than

trained on samples,. In general, assuming k is not very large and

,LOO is much more expensive to calculate than k-fold cross-validation.

In terms of accuracy, LOO's models often result in high variance for test errors. Intuitively, because among n samples

are used to build models, and the models constructed from each fold are essentially identical to each other and to the model built from the entire training set.

However, if the learning curve of the problem is steep for the training set size, then 5-fold or 10-fold cross-validation may overestimate the generalization error.

As a general rule, most authors and empirical evidence suggest that 5-fold or 10-fold cross-validation is better than LOO.

3.1.2.5. Leave-P-Out - LPO

LeavePOut and LeaveOneOutare similar in that it creates training/test sets by removing P samples from the complete data set. For n pairs of samples, it will produce

training-test pairs. Unlike LeaveOneOutand KFold, for

The test sets will overlap.

Leave-2-Out for 4 samples:

>>> from sklearn.cross_validation import LeavePOut

>>> lpo = LeavePOut(4, p=2)
>>> for train, test in lpo:
...     print("%s %s" % (train, test))
[2 3] [0 1]
[1 3] [0 2]
[1 2] [0 3]
[0 3] [1 2]
[0 2] [1 3]
[0 1] [2 3]

3.1.2.6. Leave-One-Label-Out - LOLO

LeaveOneLabelOut The (LOLO) cross-validation scheme takes a portion of the samples based on an array of integer labels provided by a third party. This tag information can be used to encode any domain-specific predefined cross-validation.

Therefore, each training group consists of samples except those with a specific label.

For example, in the case of multiple experiments, LOLO can be used to create cross-validation based on different experiments: we create a training set using samples from all experiments except one:

>>> from sklearn.cross_validation import LeaveOneLabelOut

>>> labels = [1, 1, 2, 2]
>>> lolo = LeaveOneLabelOut(labels)
>>> for train, test in lolo:
...     print("%s %s" % (train, test))
[2 3] [0 1]
[0 1] [2 3]

Another common application is to use temporal information: for example the labels might be collected by year, thus allowing cross-validation for time-based splits.

警告：与StratifiedKFold相反，LeaveOneLabelOut类的labels不应该编码目标类去进行预测：StratifiedKFold是为了重新平衡因为分割后
数据集，确保训练/测试集中的样本有大约相同比重的类。而LeaveOneLabelOut将会做相反的事，它确保训练和测试集中不会共享相同的标签。

3.1.2.7. Leave-P-Label-Out

LeavePLabelOut is Leave-One-Label-Outsimilar to t, but it removes P labels from the sample.

An example of Leave-2-Label Out:

>>> from sklearn.cross_validation import LeavePLabelOut

>>> labels = np.array([1, 1, 2, 2, 3, 3])
>>> lplo = LeavePLabelOut(labels, p=2)
>>> for train, test in lplo:
...    print("%s %s" % (train, test))
...    print("%s %s" % (labels[train], labels[test]))
...    print("- - - - - - - - - - - - ")    
[4 5] [0 1 2 3]
[3 3] [1 1 2 2]
- - - - - - - - - - - - 
[2 3] [0 1 4 5]
[2 2] [1 1 3 3]
- - - - - - - - - - - - 
[0 1] [2 3 4 5]
[1 1] [2 2 3 3]
- - - - - - - - - - - -

3.1.2.8. Random permutations cross-validation a.k.a. Shuffle & Split

ShuffleSplitThe iterator will generate a user-defined number of independent training/test data sets. The samples are first shuffled and then split into a pair of training and test sets.

You can random_statecontrol the reproducibility of results by setting a random seed.

An example of Leave-2-Label Out:

>>> ss = cross_validation.ShuffleSplit(5, n_iter=3, test_size=0.25,
...     random_state=0)
>>> for train_index, test_index in ss:
...     print("%s %s" % (train_index, test_index))
...
[1 3 4] [2 0]
[1 4 3] [0 2]
[4 0 2] [1 3]

ShuffleSplit is a good alternative to KFold because it has finer control over the number of iterations and the ratio of training set to test set.

3.1.2.9. Label-Shuffle-Split

LabelShuffleSplit It is a combination of ShuffleSplit and LeavePLabelsOut.

Here is a usage example:

>>> from sklearn.cross_validation import LabelShuffleSplit

>>> labels = np.array([1, 1, 2, 2, 3, 3, 4, 4])
>>> slo = LabelShuffleSplit(labels, n_iter=4, test_size=0.5,
                       random_state=0)
>>> for train, test in slo:
...    print("%s %s" % (train, test))
...    print("%s %s" % (labels[train], labels[test]))
...    print("- - - - - - - - - - - - ")  
[0 1 2 3] [4 5 6 7]
[1 1 2 2] [3 3 4 4]
- - - - - - - - - - - - 
[2 3 6 7] [0 1 4 5]
[2 2 4 4] [1 1 3 3]
- - - - - - - - - - - - 
[2 3 4 5] [0 1 6 7]
[2 2 3 3] [1 1 4 4]
- - - - - - - - - - - - 
[4 5 6 7] [0 1 2 3]
[3 3 4 4] [1 1 2 2]
- - - - - - - - - - - -

It is useful when LeavePLabelsOutthis behavior is desired, but all possible partitions of P labels are expensive because the number of labels is so large . LeavePLabelsOutIn this case, LabelShuffleSplitprovide a random LeavePLabelsOutsplit of generated training/test samples.

3.1.2.10. Predefined Fold-Splits / Validation-Sets

For some datasets, a pre-defined splitting of the data into training and validation folds or several cross-validation folds already exists. When searching PredefinedSplitfor hyperparameters, it may use these folds.

3.1.2.11. StratifiedShuffleSplit

StratifiedShuffleSplitYes ShuffleSplitvariant, which returns a hierarchical split that creates splits by keeping the percentage of each target class in the full set.

3.1.3. A note on shuffling

If the ordering of the data is not arbitrary (for example, samples with the same label are consecutive), it may be important to shuffle it the first time to get a meaningful cross-validation result. However, the converse may also be true if the samples are not independently and identically distributed. For example, if the sample corresponds to news articles and is sorted by the time of their publication, then shuffling the data will likely cause a model to overfit and get an inflated validation score: it will be in an artificial Test on samples similar to the training samples (similar in time).

Some cross-validation iterators KFoldhave a built-in option to shuffle the data before splitting it. Remember:
　　1. This consumes less memory than shuffling the data directly.
　　2. Shuffling will not occur by default, including pairs of (stratified) K-fold crossovers by specifying cv = some_integerto cross_val_score , grid search, etc. But remember that train_test_splita random split will still be returned.
　　3. random_stateSet to means that the shuffle will be different Noneevery iteration. KFold(..., shuffle=True)However GridSearchCV calling its fitmethod individually will use the same shuffle for setting each parameter set validation.
　　4. To ensure results are reproducible on the same platform, random_stateuse a fixed value.

3.1.4. Cross validation and model selection

Cross-validation iterators can also be used for model selection using a grid search for optimal model parameters. See Grid Search: Searching for estimator parameters.

Machine Learning: Cross Validation -- Evaluating Predictive Performance

Guess you like