(Deep Learning) Cross Validation (Cross Validation)

Cross Validation

        Cross Validation is a statistical method to evaluate the generalization performance of the model, which is more stable and comprehensive than the method of dividing the training set and test set once.

        Cross-validation can not only solve the problem of insufficient data in the dataset, but also solve the problem of model parameter tuning.

There are three main methods of cross-validation:

1. Simple Cross Validation (Simple Cross Validation)
        Among them, the simple cross validation randomly divides the original data set into two parts: the training set (Train Set) and the test set (Test Set).
        For example: divide the original data sample into two parts according to the ratio of 7:3, 70% of the samples are used for training the model, and 30% of the samples are used for testing the model and parameters.

shortcoming: 

        (1) The data sample is used only once and is not fully utilized.

        (2) The final evaluation index obtained on the test set may have a great relationship with the division of the data set.

 

2. K-fold Cross Validation (K-fold Cross Validation)

        In order to solve the deficiency of simple cross-validation, K-fold cross-validation is proposed.

The process of K-fold cross-validation is:

        (1) First divide all samples into K sample subsets of equal size

        (2) Traverse the K subsets in turn, each time the current subset is used as the verification set, and all other samples are used as the training set for model training and evaluation

        (3) Finally, the average value of K evaluation indicators is used as the final evaluation indicator. In actual experiments, K usually takes 10

For example, when K is 10, K-fold cross-validation is shown in the figure below:

        (1) First divide the original data set into 10 parts, and the number of data samples contained in each part is D

        (2) Each time one of them is used as a test set, and the remaining 9 (that is, K-1 copies) are used as a training set. At this time, the training set becomes (K-1)*D

        (3) Finally, calculate the average value of the evaluation indicators obtained K times as the true performance of the model or hypothetical function

 

 

3. Leave-one-out Cross Validation

        Leave-one-out cross-validation is a special case of K-fold cross-validation. When K is equal to the number of samples N, for these N samples, N-1 samples are selected for training data each time, and one sample is reserved to verify the quality of the model prediction.

        Leave-one-out cross-validation is mainly used when the sample size is very small. For example, for ordinary moderate problems, leave-one-out cross-validation is usually used when N is less than 50.

        The cross-validation method is relatively simple and convincing in terms of data understanding, but it should be kept in mind that when the total number of samples is too large, the time cost of using the leave-one-out method is extremely high.

The following content is reproduced from Zhihu:

One: Cross Validation ( Cross Validation )

The most commonly used verification method before         K-fold cross- validation is simple cross-validation, which divides the data into training set ( Train Set ) , validation set ( Validation Set ) and test set ( Test Set ) . The general division ratio is 6:2:2 . However, how to reasonably extract samples has become the difficulty of using cross-validation. Different extraction methods will lead to completely different training performance. At the same time, since the verification set and test set are not involved in training , a large amount of data cannot be applied to learning , so it will obviously lead to a decline in the effect of training.

Two: K-fold cross-validation

        Divide the training set ( Train Set ) data into K parts, use K-1 of them to train the model , and use the remaining 1 as a test, and finally take the average test error as the generalization error. The advantage of this is that all samples in the training set (Train Set) must have the opportunity to become a test set while they must become training data. Therefore, K-fold cross-validation can make better use of the training set (Train Set) data.

        In K-fold cross-validation, the larger K is, the more reliable the average error result that is regarded as the generalization error is, but the larger K is, the time spent on K-fold cross-validation also increases linearly .

Three: Existing problems

        The above are all the contents of the book, but I found a problem in this step ? That is, whether it is necessary to divide the training set ( Train Set ) and the test set ( Test Set ) before performing the K-fold .

        If you divide the training set ( Train Set ) and the test set ( Test Set ) , (the experiment of running the paper) when using the public data set for training, your results may be worse than others in the case of the same network and the same data set (you only use 80 % of the data for training, and you can use all the data for training without division ).

If you divide the test set, what should you do         in some small-scale data sets ? Maybe the data set itself has only a small amount of data. At this time, 20% is used for testing, and the data for training is even more insufficient.

        If you do not divide the test set ( Test Set ) , directly perform K-fold on all the data, the number of network layers, the learning rate ( Learning Rate ) and other parameters are easy to set, but how to decide the learning round ( Epoch ) and to what extent to stop learning. You can't choose the best epoch on the test set, because that will leak some information to the model. At the same time, what if you want to choose the best model in the end?

Four: Feasible given under different circumstances - K-fold verification scheme

Case 1: Big Data Scale

        Directly use Simple Cross Validation (Simple Cross Validation) without using K-folds. Because the data scale is large, even ifthe training set(Train Set)-validation set(Validation Set)-test set(Test Set)6:2:2. 60%ofthe data is sufficient to represent the distribution of all data.

        For example: Now we need to use statistical methods to calculate the probability of each point appearing when throwing the dice. You've now run a million independent experiments, and even if you only used 600,000 of them, that's enough to get a convincing probability that each point occurs 1 in 6 times.

Case 2: Small and medium-sized data

1: The situation used by the company: first divide the training set ( Train Set ) and the test set ( Test Set ) . K- folds are performed on the training set , and the model with the smallest error in the verification set for each K - fold (because the training set and the test set are divided in advance , the test set in the K-fold described in the book I call it the validation set ( Validation Set ) here ) is put on the test set for testing and the test error is calculated. The performance of the final model is the average of the errors of the selected models in each fold on the test set.

(Why is the model with the smallest error on the validation set for every 1 fold in the K fold? Because we don’t know how many rounds the algorithm needs to train to achieve the best results before training, so my idea is to let him run as much as possible, and then select the model that performs best on the validation set. Then throw the selected model on the test set for testing . )

 

Schematic diagram of the flow of consciousness for small and medium-scale data sets and the process of the company's commercial situation

ps: After the division, there will be a problem of insufficient data in the training set, but you must select a suitable model for deployment in the company's project, and you will not be able to select a suitable model without first dividing the training set and test set!

2: The situation of the paper experiment: If the training set ( Train Set ) and the test set ( Test Set ) are divided in the paper experiment , there will be a problem of persuasion . That is: how to ensure that the test set you choose is not a simple example that you have carefully selected and is very easy to judge! Therefore, without the need to select the best model but only to evaluate the effect of the method, K-fold can be performed directly on all the data. The advantage of this is that the more data you use, the better the effect of the model, and the test error on the test set will be closer to the generalization error.

But doing this will have a problem with the number of training iterations, namely: when will you stop your learning process.

In the case of only splitting the training set ( Train Set ) and the test set ( Test Set ) , you only have two possible approaches:

        (1) Select the best effect on the test set, which has the problem of leaking the data distribution of the test set to the training set.

        (2) The number of iterations is fixed. There is a problem of how to choose the number of iterations.

        So here's my approach: do K-folds on the entire dataset. In the divided training set(Train Set), extract a small part, such as5%as the validation set(Validation), and then putthe model with the best effect on the validation set (Validation Set) into the test set(TestSet)fortesting, and thenrepeatK times, the generalization error is approximately equal to the average of K test errors.

This approach has two advantages :

        (1) All samples appeared once in the test set , that is, there is no problem of insufficient convincing power.

                (I've tested them all, it's impossible to say that I deliberately chose the easiest one to evaluate!)

        (2) The sample data in the training data set is not significantly reduced.

                (I just divide a small part of the divided training set as a verification set), and the obtained model effect will be closer to the true generalization error of the model.

Reference:

  1. K-fold validation cross-validation_k-fold cross-validation_*Snowgrass* Blog-CSDN Blog
  2. K-fold cross-validation (remember a pit) bzdww
  3. http://t.csdn.cn/8hgXy

Guess you like

Origin blog.csdn.net/qq_40728667/article/details/130210328