Cross-validation (Cross Validation) principle Summary

I'm just a porter, from the following: Liu Jianping Pinard: https://www.cnblogs.com/pinard/p/5992719.html


  Cross-validation is used when the machine learning model parameters and validate the model approach.
1. Definitions : cross-validation, by definition, is the use of repeated data, the sample data obtained be segmented, a combination of different training and test sets, with training set to train the model, with the test set of model predictions to assess good Bad. On this basis, can get different sets of training and test sets, a training set a particular sample could be tested in the next set of samples, the so-called "cross."

2. Application : What cross-validation when it needs it? Cross-validation data is not very adequate for use in the time.

  • (1) If the data sample size is less than ten thousand, we will use cross-validation to train and optimize the selection model.
  • (2) If the sample is greater than ten thousand words, we generally random data into three, a training set (Training Set), is a validation set (Validation Set), and finally a test set (Test Set). A training set to train the model, using the model parameter validation set to assess the quality of the prediction model and their corresponding selection. The final model is then obtained for the test set, and the final decision on which the corresponding model parameters.

Category 3 : Back cross validation, depending on the method of segmentation, cross-validation is divided into the following three:

  • The first is a simple cross-validation , the so-called simple, and other cross validation relative. First, we randomly sample data into two parts (for example: 70% of the training set, 30% of the test set), then the training set to train the model, and the model parameters in the validation test set. Next, we then disrupted sample, re-select the training and test sets, continue training data and test the model. Finally, we choose the best model to assess the loss of function and parameters.

  • The second is the S-fold cross validation (S-Folder Cross Validation). A method and different, S fold cross-validation sample data will randomly divided into parts S, each S-1 randomly selected parts of the training set, the remaining parts do a test set. When this round is completed, again randomly selected parts of S-1 to the training data. After several rounds (less than S), to select the optimal model loss function and parameter evaluation.

  • The third is to stay a cross-validation (Leave-one-out Cross Validation ), which is a special case of the second case, the number of samples is equal to N S at this time, so that for N samples, each selected sample to the N-1 training data, leaving a sample to verify the quality of the model predictions. This method is mainly used in the case of very small sample size, such as ordinary moderate problem, when N is less than 50, I usually use a left cross-validation.

4. Select : The cross-validation is repeated, with the loss function to measure the quality of the model obtained and finally we get a better model. That these three cases, in the end which way we should choose it? Sentence summary, if we just do a preliminary data model, not to do in-depth analysis of the case, a simple cross-validation on it. Otherwise fold cross-validation with S. When a small amount of sample, using S-fold cross validation leave a special case of cross-validation.

  There is also a special cross-validation model, which is less time for the sample. Called the bootstrap method (bootstrapping). For example, we have m sample (small m), in each of the m samples randomly collected a sample into the training set, after the sample back into the sample. This acquisition repeated m times, we have a training set consisting of m samples. Of course, this is likely to m samples in duplicate sample data. At the same time, we have not been sampled with sample collection for testing. Such followed by cross-validation. Since we have a duplicate set of training data, which changes the distribution of data, and therefore the training results will be biased estimates, therefore, this method is not very common, unless really a very small amount of data, such as less than 20.


(Welcome to reprint, please indicate the source Liu Jianping Pinard: https://www.cnblogs.com/pinard/p/5992719.html)

Guess you like

Origin blog.csdn.net/qq_38861305/article/details/90759537