Machine Learning-L3-training set and test set

1. Overfitting and underfitting

  • Underfitting: The model fails to fit the training data, high bias.
  • Overfitting: The model fails to fit the test data, high variance.

Overfitting usually occurs when there are too many features, and the model can always fit the training data well, but the generalization ability (the ability of the model to be applied to new samples) is poor and cannot be applied to new data samples.

In order to prevent overfitting of data, the data set is generally divided into two parts:

  • Training set (Training set): used to train the model
  • Test set (Test set): used to test the model.

Sometimes in the training process of the model, the auxiliary model is constructed for adjusting parameters (such as selecting the number of hidden units in the neural network), and the training data is further divided into a training set and a validation set (Validation Set).
The validation set can be reused in model training, while the test set is only used for model detection to assess the accuracy of the model, and is not allowed for model training.
In practical applications, generally only the data set is divided into a training set and a test set.

"Pattern Recognition and Neural Networks" (Ripley, BD, 1996) defines the three words as follows:

  • Training set: A set of examples used for learning, which is to fit the parameters [i.e., weights] of the classifier.
  • Validation set: A set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network.
  • Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier.

2. Training set and test set

2.1 Holdout method

The hold method is a common method for dividing the training set and the test set. The given data is randomly divided into two independent sets, usually distributed to the training set and the test set at a ratio of 75/25 or 80/20.

2.2 Cross-validation

In k-fold cross-validation, the initial data is randomly divided into k disjoint subsets (folds), S 1 , S 2 , . . . , S k S_1, S_2, ... S_k .
Train and check k times, in the i-th iteration, partition S i S_i As a test set, the remaining partitions are used as training models.
That is the first iteration, the subset S 2 , . . . , S k S_2, ... S_k As the training set, get the first model, and in S 1 S_1 Test; second iteration, subset S 1 , S 3 , . . . , S k S_1, S_3, ... S_k As a training set, get the second model, and in S 2 S_2 Go to the test; so go on. Each sample is used for training the same number of times and is only used for 1 test.
Insert picture description here

Published 17 original articles · won 20 · views 833

Guess you like

Origin blog.csdn.net/apr15/article/details/105545502