[Machine Learning] Cross-validation (Cross Validation)

Cross validation (Cross Validation)


What is cross-validation (Cross Validation)

Cross-validation is a model validation technique can be used to assess the statistical analysis (model) results in the other independent data sets generalization . It is mainly used to predict, we can use it to evaluate the accuracy of predictive models in practice .

What is a generalization (Generalization) capacity?

In the " [Machine Learning] Regression (Regression) " has mentioned here briefly explain:

Popular generalization refers to the predictive ability of the model to study the unknown data .

Cross validation (Cross Validation) object

In practical training, the training data model is usually good, but the degree of fit to the data than the difference of the training data . Accordingly, in order to obtain reliable and stable model, cross-validation may be used when the prediction model, a model to accurately measure the effect on the actual data set.

Cross-validation (Cross Validation) The basic idea

In some sense the original data packet , as part of the training set (the Data Training) , as part of the validation set (the Set Validation) , as well as part of the test set (Testing the Set) .

  • Training set

    Training set is used to train the model, that is, to determine the parameters of the model

  • Validation set

    Validation set for model selection , and more specifically, the test set is determined (i.e., the model training process) not involved parameters . Validation set just to select hyper-parameters , such as network layers, network nodes, learning rate these parameter settings before starting the learning process .

  • Test Set

    Test set is used only once, that is used when evaluating the final model after the training is complete. It is neither involved in the training process model, nor participate in the selection process of hyper-parameters, but only for the evaluation model.

  1. Use training set to train the model
  2. Using the validation set to test the model is trained, in order to evaluate the performance of the model

Effect of cross-validation (Cross Validation) of

  • Contribute to quality assessment model
  • Help us pick out those models can achieve the best performance on the forecast data set
  • Help us to avoid over-fitting and underfitting

What is underfitting & overfitting it? In the " [Machine Learning] underfitting & over-fitting (Underfitting & overfitting) " has mentioned here briefly explain:

Underfitting refers to the model can not learn enough from the training data model (model fitting degree is not high, data fitting curve far, the model is not well capture the data characteristics can not be a good fit data). In this case, the model performance in the training and test sets are very poor.

Over-fitting means in order to get a consistent hypothesis the assumption becomes overly strict. In this case, the model did well in the training set, but it is a very poor performance on the test set.

for example

In practice, the dataset is generally divided into a training set (Training Set), the known test set (Public Testing Set), the unknown test set (Private Testing Set).

Specific cross-validation steps:

  • Split the training set

    The training set (Training Set) is split into a true training set (Training Set) and validation set (Validation Set)

  • Trainer

    Suppose you have three current model, your goal is to select the best model. These three models trained with a training set (Training Set)

  • Verification Model

    After training, the three have been trained to validate the model with the validation set (Validation Set), and then pick the best model (Error minimum). If you think training determined the original training set is split into two groups after the training data set fewer, then you can re-use the training set (Training Set) unresolved in determining the model after model

  • Test Model

    The trained model is determined using a test and set (Testing Set) Error calculation and prediction, often will be greater than the calculated Error Error models on the validation set (Validation Set), but the test set Error (Testing Set) was you are able to truly reflect other unknown Error on test set (Private testing set) of

Note : After Do not use the test set in the training process, and then the same test set to test the model . In fact, doing so is a cheat behavior so that when a test model accuracy rate, but in fact may not be accurate model will increase the rate on other unknown test set. Such as Tianchi, DC and other online game, there will be a corresponding set of back testing, these tests are set you do not know (do not hold even in the mood, but do not do it)

k-fold cross-validation (k-fold Cross Validation)

When there is insufficient data when used in training the model, a portion of partitioning the data obtained to verify the model leads to poor fitting , reduced training set, the model will lose some important features of the dataset or trends, which may increase the deviation (BIAS) results in error .

Therefore, we need a method to provide a set of samples to train the model and remains part of the data set used to validate the model - K-fold cross validation .

DETAILED step (following figure 3 fold cross validation):

  • Disrupt the data set, the data set will be disrupted even into k parts
  • Alternately selecting the k-1 where parts of the training set (Training Set), the remaining parts of a validation set (Validation Set), and calculates the Error Model
  • After iteration times k k times the average Error do as the basis for selecting the optimal model

Advantages: k k-fold cross-validation after performing fold cross validation, using k times the average score as a score of the entire model. Each appear once validation dataset, k-1 times and appeared in training.

  • This will significantly reduce the less fit , since most of the data in the data set for training
  • While also reducing the possibility of over-fitting , because most of the data is also used to validate the model.

Articles on here, but please help errata!

(If the article helpful, you may want to point praise or comment Thank you for your support ~ ~)

Released six original articles · won praise 3 · Views 3440

Guess you like

Origin blog.csdn.net/Oh_MyBug/article/details/104304085