Training set, validation set, the test set (and why you want to use the validation set?) (Training Set, Validation Set, Test Set) and over-fitting underfitting (Over fitting & Under fitting)

The concept for the training set, validation set, test set, many people are confused. Online articles is quite a mixed bag, so now this sort of knowledge again. Let's first look at the model validation (assessment) in several ways.

 

In machine learning, when we trained the model, the model for how to verify it? (How do you know that is good trained model?) There are several forms of authentication:

 

The first way: all of the data as the training set, then the training set to train the model, the model verification by the training set (if there are multiple models needs to be selected, then the last training minimum error is selected as the best model for that model)

This approach is clearly not possible, because the more complex models, the smaller the error of their training, the training error appears to be very perfect, but in fact, the model may have seriously been fitted. This " over-fitting and underfitting (Over & Under Fitting Fitting) " article has already been mentioned.

 

The second way: the data into training and test sets, with training set and then train the model, with the test set to validate the model (if there is more than one model needs to be selected, the last selected measurement error smallest one as the best model model)

What kind of model is good? Obviously the minimum generalization error of the best model, but we do not have the test set can detect a generalization error model. Therefore, we tested as part of the data set, with its errors to simulate generalization error. As part of the separation of the data means that the test set training set smaller than the original. It is seen from the learning curve, using a model trained less data, which test error will be relatively large. Therefore, a reasonable approach is: after training with a training set of each model, with the test set to select the best model in which the best record of the selected model (which algorithm for example, the number of iterations to use a few times, learning rate is the number, what aspects of the conversion is the way, what kind of regularization way, regularization factor is how much, etc.) , then the entire data re-train a new model, as the final model, this effect will be derived model better, it would be closer to the test error generalization error, we called this test error test error a.

 

The following figure shows an increase as the test set, the test error (red line), the test error a (blue line), the training error (black solid line) and the generalization error (dashed black line) trends:

 

 

You can see, the performance of a test error closest to the generalization error. With increasing test set, test error (red line) in the first performance and test error a (blue line) closer, then more is better to test error a (blue line), and even not as good as the training error (black solid line). This is because the larger set of tests for the less training data, and the effect of the training model will certainly bad. Accordingly, when selecting the size of the test set, in fact, there is a dilemma: To test if the error is close to a generalization error , it is necessary to make the test set relatively large go, because there is enough data to analog unknowns, but as a result, the gap between the test error and a testing error is relatively large; and to get closer to a test error test error , you need to test set relatively small to do, because it has enough data to train the model, but at this time a gap between the measurement errors and generalization errors. Generally, it is generally the size of the test set to 20% of all data.

 

Many tutorials are so data is divided into a training set (80%) and test set (20%). To do so provided that: We put various possible options are listed, trained various models, and then select the best model in the test set. This has two problems. First, ultra-parameter model typically much, and needs to be adjusted according to actual situation. If the model test results are not satisfactory, then we need to go back and re-training model. Although information about the test set is not used to train the model, but if we continue to adjust based on the measurement error model, this test will bring to the model set to go. Obviously, this is not feasible, because the test set must be that we have never seen the data, otherwise the results would be too perfect, it will lead to over-fitting occurs. Second, the best model selected, its generalization error is how much? We still can not be assessed. Because all the data we again re-training the best out of this model, so there is no data to test never seen this model is the best.

 

The third embodiment: the data into training set and test set validation set, then the training set to train the model, validating the model with the validation set, the final selection of the model, with the final model evaluation test set

In the second way, we have the data into training and test sets, and now we need to come up with a test suite for evaluating the final model. Because there is already a test suite, so we changed his name to one of the test set validation set, to prevent confusion. The first steps and a second similar manner: First, a training set to train the model, and then verify set validation model (note: this is an intermediate process, then the final model has not been selected), the model continuously adjusted according to the situation to elect one of the best model (validation error to guide us to choose which model), the best record of the selection model , and then accordingly (training set + validation set) data to train a new model, as the final model, and finally with the test set to evaluate the final model.

 

Since the authentication information is set into the data to the model, and therefore, the verification error is usually smaller than the measurement error. Also need to remember is: test error is the end result we get, even if we are not satisfied with test scores should not be re-adjusted return model, because this information will bring to the model test set to go.

 

The fourth embodiment: See particularly cross-validation --- "validation and cross-validation (Validation & Cross Validation)"

 

in conclusion:

Training set (Training the Set) : used for training the model.

The validation set (the Validation the Set) : Model for selection.

Test set (the Test the Set) : used to assess the final model selected.

 

When we get the data, in general, we have such data into three: the training set (60%), validation set (20%) and the test set (20%). Training the models with a training set and validation set with the authentication model, based on a constant adjustment model, wherein selecting the best model, the best record of the selected model , and then accordingly (+ validation set training set) to train a new data model as the final model, final assessment of the final model with the test set.

 

Guess you like

Origin www.cnblogs.com/HuZihu/p/10538295.html