Lecture 15: Validation
15-1 Model selection problems
What is a good model? A: Eout can be minimized.
But this faces a problem: it is impossible for us to know the value of Eout.
So how to choose? You can't choose visually. (if it is high-dimensional)
Choose the one with the smallest Ein? Answer: No, overfitting; or there may be bad generalization.
One answer: pick the model with the best test results. Leave a small part of the existing data as a test set for the finished model.
15-2 Test Set
Illustration of the answer to the model selection question above:
Use different models H to get different Eout, and then compare to find the best one.
Comparing the gm of all data and all data minus the gm of the validation data, there are:
When the validation set is small, gm and gm- are about the same;
When the validation set is large, gm performs better than gm-.
15-3 leave-one-out cross-validation
A schematic of this approach (linear and constant, respectively):
At this time, when the data size is large, gm and gm- are almost the same.
15-4 V-Fold Cross Validation
Disadvantage 1 of leave-one-out: If there are 1000 points, it will be done 1000 times.
A simple method of leave-one-out: linear regression, at this time there is a formula solution for leave-one-out.
Disadvantage 2 of leave-one-out: The stability is too poor when doing binary problems (1/0).
So it is not often used in practice.
Improvements of V-Fold on leave-one-out:
For example, during ten-fold cross-validation, take turns to take nine training and one validation.