Training set, validation set, the test set is too fit and less fit (Over fitting & Under fitting)

Transfer: https://www.cnblogs.com/HuZihu/p/10538295.html

The concept for the training set, validation set, test set, many people are confused. Online articles is quite a mixed bag, so now this sort of knowledge again. Let's first look at the model validation (assessment) in several ways.

 

In machine learning, when we trained the model, the model for how to verify it? (How do you know that is good trained model?) There are several forms of authentication:

 

The first way: the entire data set as a training set, then the training set to train the model, the model verification by the training set (if there are multiple models needs to be selected, then the last training minimum error is selected as the best model that models )

This approach is clearly not feasible, so the training set of data has been used in the model fitting, and then the model was validated using the same data, the result would be overly optimistic. If we carry out the evaluation and selection of multiple models, then we can find, the more complex the model, the smaller the error of their training, the training error when a model appears to be very perfect, in fact, the model may have been seriously proposed the co. This " over-fitting and underfitting (Over & Under Fitting Fitting) " article has already been mentioned. (We refer to this error is selected out by the training model called G m-Hat )

 

The second way: the data sets were randomly divided into training and test sets, with training set and then train the model, with the test set to validate the model (if there is more than one model needs to be selected, the last selected test model as the smallest error the best model)

What kind of model is good? Obviously the minimum generalization error of the best model, but we do not have the test set can detect a generalization error model. Therefore, we tested as part of the data set, with its errors to simulate generalization error.

 

As part of the separation of the data means that the test set training set smaller than the original. It is seen from the learning curve, using a model trained less data, which test error will be relatively large. Thus, for evaluating and selecting a plurality of models, a reasonable approach is: each trained model, with the test set with the best model of the training set is selected in which (we called the model G m * - ), records the most well the set model (which algorithm, for example, use a few iterations, learning rate is how much and what aspects of the conversion is the way, what kind of regularization way, regularization factor is how much, etc.) , then the whole a training data set and then a new model, as the final model (model we called G m * ), which would result from the model would be better, the error will be closer to the test generalization error.

 

The following figure shows an increase as the test set, each model - gm * - (red), G m * (blue line), G m-Hat (solid black line) and a desired ideal generalization error generalization error (dashed black line) trends:

 

Can be seen, G m * (blue line) the best performance closest to the ideal generalization error (dashed black line). With the test set growing, gm * - (red) and the first expression G m * (blue line) closer, then more is better G m * (blue line), and even not as G M- Hat (solid black line). This is because the larger set of tests for the less training data, and the effect of the training model will certainly bad. Accordingly, when selecting the size of the test set, in fact, there is a dilemma: if you want G m * (blue line) is close to the desired ideal generalization error generalization error , it is necessary to make the test set relatively large to do as such there are plenty of unknowns analog data, but this way, G m * (blue line) and gm * - the difference between the (red) of the desired generalization error is relatively large; and want to G m * (blue line) and gm * - (red line) is close to the desired generalization error , we need test set relatively small to do, because there is enough data to train the model, but this time G m * (blue line) of the desired generalization error generalization error between the ideal and the big gap. Generally, it is generally the size of the test set is set to 20% to 30% of all data.

 

A lot of information so the data are divided into a training set (70% -80%) and test set (20% -30%). To do so provided that: the various possible settings are listed model, training the various models, and then select the best model in the test set, followed by all the data in accordance with the model set best re-trained a final model. This has two problems. First, the ultra-parameter model is usually a lot, we are less likely to put all possible settings exhaustive list, super parameters usually need to be adjusted according to the actual situation. If the model test results are not satisfactory, then we need to go back and re-training model. Although information about the test set is not used to train the model, but if we continue to adjust based on the measurement error model, this test will bring to the model set to go. Obviously, this is not feasible, because the test set must be that we have never seen the data, otherwise the results would be too optimistic, but also will lead to over-fitting occurs. Second, the final model derived its generalization error is how much? We still can not be assessed. Because all the data we again re-training out of this final model, so there is no data to test never seen this final model of.

 

The third way: the set of data were randomly divided into training set and test set validation set, then the training set to train the model, validating the model with the validation set, the model continuously adjusted according to the situation, wherein the selected model is the best, then the training data training and validation sets a final model, the final assessment of the final model with the test set

In fact, this is already a model to evaluate and model the whole process of selection. In the second way, we have the data set into a training set and test set, and now we need to set off another test used to assess the final model. Because there is already a test suite, so we used a model in which the test set selection changed his name validation set, to prevent confusion. (First data set is divided into training and test sets some of the information, and then the training set is divided into training and validation sets)

 

The first steps and a second similar manner: First, a training set to train the model, and then verify set validation model (note: this is an intermediate process, then the best model has not been selected), continually adjusted according to model, choose the best model in which (validation error to guide us to choose which model), the best record of the set model , and then accordingly (training set + validation set) data to train a new model, as a final model, a final assessment of the final model with the test set.

 

Since the authentication information is set into the data to the model, and therefore, the verification error is usually smaller than the measurement error. Also need to remember is: test error is the end result we get, even if we are not satisfied with test scores should not be re-adjusted return model, because this information will bring to the model test set to go.

 

The fourth embodiment: cross-validation --- DETAILED see "validation and cross-validation (Validation & Cross Validation)"

 

Fifth way: Bootstrap  --- specifically, see "Bootstrap (Bootstraping)"

 

in conclusion:

Training set (Training the Set) : used for training the model.

The validation set (the Validation the Set) : used to adjust and select the model.

Test set (the Test the Set) : The model used for the final assessment.

 

When we get the data, in general, we have such data into three: the training set (60%), validation set (20%) and the test set (20%). Training the models with a training set and validation set with the authentication model, based on a constant adjustment model, wherein selecting the best model, the best record of the selected model , and then accordingly (+ validation set training set) to train a new data model as the final model, final assessment of the final model with the test set.

 

The concept for the training set, validation set, test set, many people are confused. Online articles is quite a mixed bag, so now this sort of knowledge again. Let's first look at the model validation (assessment) in several ways.

 

In machine learning, when we trained the model, the model for how to verify it? (How do you know that is good trained model?) There are several forms of authentication:

 

The first way: the entire data set as a training set, then the training set to train the model, the model verification by the training set (if there are multiple models needs to be selected, then the last training minimum error is selected as the best model that models )

This approach is clearly not feasible, so the training set of data has been used in the model fitting, and then the model was validated using the same data, the result would be overly optimistic. If we carry out the evaluation and selection of multiple models, then we can find, the more complex the model, the smaller the error of their training, the training error when a model appears to be very perfect, in fact, the model may have been seriously proposed the co. This " over-fitting and underfitting (Over & Under Fitting Fitting) " article has already been mentioned. (We refer to this error is selected out by the training model called G m-Hat )

 

The second way: the data sets were randomly divided into training and test sets, with training set and then train the model, with the test set to validate the model (if there is more than one model needs to be selected, the last selected test model as the smallest error the best model)

What kind of model is good? Obviously the minimum generalization error of the best model, but we do not have the test set can detect a generalization error model. Therefore, we tested as part of the data set, with its errors to simulate generalization error.

 

As part of the separation of the data means that the test set training set smaller than the original. It is seen from the learning curve, using a model trained less data, which test error will be relatively large. Thus, for evaluating and selecting a plurality of models, a reasonable approach is: each trained model, with the test set with the best model of the training set is selected in which (we called the model G m * - ), records the most well the set model (which algorithm, for example, use a few iterations, learning rate is how much and what aspects of the conversion is the way, what kind of regularization way, regularization factor is how much, etc.) , then the whole a training data set and then a new model, as the final model (model we called G m * ), which would result from the model would be better, the error will be closer to the test generalization error.

 

The following figure shows an increase as the test set, each model - gm * - (red), G m * (blue line), G m-Hat (solid black line) and a desired ideal generalization error generalization error (dashed black line) trends:

 

Can be seen, G m * (blue line) the best performance closest to the ideal generalization error (dashed black line). With the test set growing, gm * - (red) and the first expression G m * (blue line) closer, then more is better G m * (blue line), and even not as G M- Hat (solid black line). This is because the larger set of tests for the less training data, and the effect of the training model will certainly bad. Accordingly, when selecting the size of the test set, in fact, there is a dilemma: if you want G m * (blue line) is close to the desired ideal generalization error generalization error , it is necessary to make the test set relatively large to do as such there are plenty of unknowns analog data, but this way, G m * (blue line) and gm * - the difference between the (red) of the desired generalization error is relatively large; and want to G m * (blue line) and gm * - (red line) is close to the desired generalization error , we need test set relatively small to do, because there is enough data to train the model, but this time G m * (blue line) of the desired generalization error generalization error between the ideal and the big gap. Generally, it is generally the size of the test set is set to 20% to 30% of all data.

 

A lot of information so the data are divided into a training set (70% -80%) and test set (20% -30%). To do so provided that: the various possible settings are listed model, training the various models, and then select the best model in the test set, followed by all the data in accordance with the model set best re-trained a final model. This has two problems. First, the ultra-parameter model is usually a lot, we are less likely to put all possible settings exhaustive list, super parameters usually need to be adjusted according to the actual situation. If the model test results are not satisfactory, then we need to go back and re-training model. Although information about the test set is not used to train the model, but if we continue to adjust based on the measurement error model, this test will bring to the model set to go. Obviously, this is not feasible, because the test set must be that we have never seen the data, otherwise the results would be too optimistic, but also will lead to over-fitting occurs. Second, the final model derived its generalization error is how much? We still can not be assessed. Because all the data we again re-training out of this final model, so there is no data to test never seen this final model of.

 

The third way: the set of data were randomly divided into training set and test set validation set, then the training set to train the model, validating the model with the validation set, the model continuously adjusted according to the situation, wherein the selected model is the best, then the training data training and validation sets a final model, the final assessment of the final model with the test set

In fact, this is already a model to evaluate and model the whole process of selection. In the second way, we have the data set into a training set and test set, and now we need to set off another test used to assess the final model. Because there is already a test suite, so we used a model in which the test set selection changed his name validation set, to prevent confusion. (First data set is divided into training and test sets some of the information, and then the training set is divided into training and validation sets)

 

The first steps and a second similar manner: First, a training set to train the model, and then verify set validation model (note: this is an intermediate process, then the best model has not been selected), continually adjusted according to model, choose the best model in which (validation error to guide us to choose which model), the best record of the set model , and then accordingly (training set + validation set) data to train a new model, as a final model, a final assessment of the final model with the test set.

 

Since the authentication information is set into the data to the model, and therefore, the verification error is usually smaller than the measurement error. Also need to remember is: test error is the end result we get, even if we are not satisfied with test scores should not be re-adjusted return model, because this information will bring to the model test set to go.

 

The fourth embodiment: cross-validation --- DETAILED see "validation and cross-validation (Validation & Cross Validation)"

 

Fifth way: Bootstrap  --- specifically, see "Bootstrap (Bootstraping)"

 

in conclusion:

Training set (Training the Set) : used for training the model.

The validation set (the Validation the Set) : used to adjust and select the model.

Test set (the Test the Set) : The model used for the final assessment.

 

When we get the data, in general, we have such data into three: the training set (60%), validation set (20%) and the test set (20%). Training the models with a training set and validation set with the authentication model, based on a constant adjustment model, wherein selecting the best model, the best record of the selected model , and then accordingly (+ validation set training set) to train a new data model as the final model, final assessment of the final model with the test set.

 

Guess you like

Origin www.cnblogs.com/lzping/p/12454970.html