Machine learning, the role of training, validation and test sets

Machine learning, the role of training, validation and test sets

The meaning of training set (train), validation set (validation) and test set (test)

In supervised machine learning, it is generally necessary to divide the samples into three independent training sets (train set), validation set (validation set) and test set (test set). The training set is used to estimate the model, the validation set is used to determine the network structure or parameters that control the complexity of the model, and the test set is used to test the performance of the final selected optimal model.

Train, validation, and test. The distinction between these three sets can be confusing, especially for some readers who are confused about the difference between a validation set and a test set.


Training set: A set of examples used for learning, which is to fit the parameters [i.e., weights] of the classifier. 

Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network. 

Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier. 


Training set: learn a sample dataset and build a classifier by matching some parameters. A way of establishing a classification is mainly used to train the model.

Validation set: For the learned model, adjust the parameters of the classifier, such as selecting the number of hidden units in the neural network. The validation set is also used to determine the network structure or parameters that control the complexity of the model.

Test set: mainly to test the discrimination ability of the trained model (recognition rate, etc.)

I. Division

If we already have a large labeled dataset and want to complete the test of a supervised model, we usually use uniform random sampling to divide the dataset into three sets: training set, validation set, and test set. There can be no intersection, the common ratio is 8:1:1, of course the ratio is artificial. From this point of view, all three sets are identically distributed.


If it is a competition, the official only provides a labeled data set (as a training set) and an unlabeled test set, then when we make a model, we usually manually divide a validation set from the training set. At this time, we usually no longer divide a test set. There are two possible reasons: 1. The competitors are basically very stubborn, and the samples of the training set are originally small; 2. We cannot guarantee whether the test set to be submitted is consistent with the training set. The sets are completely identically distributed, so it doesn't make much sense to divide a test set with the same distribution as the training set.

When the sample is small, the above division is inappropriate. It is commonly used to leave a small part for the test set. Then the K-fold cross-validation method is used for the remaining N samples. It is to scramble the samples, then evenly divide them into K parts, select K-1 parts for training in turn, and the remaining part for verification, calculate the sum of squares of prediction errors, and finally average the sum of squares of prediction errors of K times as the optimal selection. The basis for the model structure. The special K is N, that is, leave one out.

II. Parameters

After you have the model, the training set is used to train the parameters. To be precise, it is generally used for gradient descent. The validation set is basically used to test the accuracy of the current model after each epoch is completed. Because the validation set does not intersect the training set, this accuracy is reliable. So why do you need a test set?

This requires a distinction between the various parameters of the model. In fact, for a model, its parameters can be divided into ordinary parameters and hyperparameters. Without introducing reinforcement learning, the common parameters can be updated by gradient descent, that is, the parameters updated by the training set. In addition, there is the concept of hyperparameters, such as the number of network layers, the number of network nodes, the number of iterations, the learning rate, etc., these parameters are not in the update range of gradient descent. Although there are already some algorithms that can be used to search for the hyperparameters of the model, in most cases we manually tune it based on the validation set.


III. So

那也就是说,从狭义来讲,验证集没有参与梯度下降的过程,也就是说是没有经过训练的;但从广义上来看,验证集却参与了一个“人工调参”的过程,我们根据验证集的结果调节了迭代数、调节了学习率等等,使得结果在验证集上最优。因此,我们也可以认为,验证集也参与了训练。


那么就很明显了,我们还需要一个完全没有经过训练的集合,那就是测试集,我们既不用测试集梯度下降,也不用它来控制超参数,只是在模型最终训练完成后,用来测试一下最后准确率。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325920923&siteId=291194637