Machine Learning-Model Evaluation and Selection (Summary)

Overfitting and underfitting

	过拟合：就是学习器把训练样本学习的太好了，可能会把训练样本自身的一些特点当作了所有潜在样本都会具有的一般性质。例如，当训练样本中的树叶中都有锯齿时，学习器就会判断没有锯齿的叶子不是属于树叶这一种类。过拟合会导致泛化能力下降。
	欠拟合：与过拟合相对，就是学习器对训练样本认识的不够准确，只学习到了样本的部分特征。例如，当训练样本是树叶，他们有共性（颜色、形状等），但是由于欠拟合，导致学习器会把所有绿色的事物判断为树叶。

Common assessment methods

1. The
hold-out method directly divides the data set D into two mutually exclusive sets, one of which is used as the training set S, and the other is used as the test set T, that is, D=S∪T, S ∩T=Ø, after training the model on S, use T to test its error as an estimate of generalization error.
Assume that the data set D contains 500 samples, S contains 300 samples, and T contains 200 samples. After using S training, if the model has 70 sample classification errors on T, the error rate is 70/200=35%, and the corresponding accuracy is 65% (accuracy + error rate = 1).
2. Cross validation method
2.1 Cross validation method (cross validation) first divides the data set D into k mutually exclusive subsets, namely D=D1UD2UD3UD3U...Dk,DiUDj=Ø(i!=j). Each subset Di maintains the consistency of the data distribution as much as possible, that is, obtained from D through hierarchical sampling. Then each time the union of k-1 subsets is used as the training set, and the remaining subset is used as the test set. In this way, k training/testing sets can be obtained, so that k training and testing can be performed, and the mean value of the k test results is finally returned. Because the stability and fidelity of the evaluation results of the cross-validation method depend to a large extent, the cross-validation method is often called "k-fold cross-validation" (k-fold cross validation).
2.2 Leave-One-Out (Leave-One-Out), referred to as LOO. The leave-one method is not affected by the random sample division method, because m samples can only be divided into m subsets in a unique way, and each subset contains one sample. Because the training set used by the leave-one method is only one sample less than the initial data set, in most cases, the evaluation result of the leave-one method is similar to the evaluation result of the entire data set D. Therefore, the leave-one method is more accurate. But when the data set D is relatively large, the cost of training the model will also be relatively large. For example, if the data set contains 10 million samples, 10 million models need to be trained.
3. Self-service method
In the leave-out method and cross-validation method, because there are a part of the samples for testing, the training set used by the evaluated model is smaller than D (only part of D), which may lead to deviations in the final estimate. Although it is less affected by the change in the size of the training sample, the computational complexity is relatively high. In order to reduce the impact of different training sample sizes, and to perform experimental estimation more efficiently, you can use the self-service method. The
self-service method is used for bootstrapping. Based on the bootstrap method, given a data set D containing m samples, we sample it to generate a data set D'. Each time we randomly select a sample from D, copy it into D', and then Put the sample back into the initial data set D, so that the sample may still be taken in the next sampling; after repeating this process m times, we get a data set D'containing m samples, which is autonomous The result of sampling. Obviously, a part of the sample in D will appear multiple times in D', while the other sample does not appear. Through simple estimation, the probability that the sample will never be taken in m sampling is (1-1 /m) to the power of m=1/e≈0.368, that is, through autonomous sampling, about 36.8% of the samples in the initial data set did not appear in the sampling set. So we can use D'as the training set, and D/D' with As a test set. This will make the actual evaluation model and the expected evaluation model use m samples, so we will still use the data that does not appear in the training set for testing. The test result is also called "out-of-package test: (out- of-estimate). The
self-service method is useful when the data set is small and it is difficult to effectively divide the training/test set; in addition, the self-service method can generate multiple different training sets from the initial data set, which is great for methods such as integrated learning Benefits. However, the data set generated by the self-service method changes the distribution of the initial data set, which will introduce estimation bias. Therefore, when the amount of initial data is sufficient, the leave-out method and cross-validation method are more commonly used. It
can be used according to different usage scenarios. Choose a different evaluation strategy from the needs.