Cross Validation of Machine Learning Basics

Cross Validation Definition

When using a certain data set to train the model, the actual training situation of the model will be directly affected by the data set, and the actual training result is difficult to determine, and underfitting and overfitting are very likely to occur. Underfitting generally means that the model has insufficient training on the data set, and thus performs poorly on both the training data set and the test data set. Overfitting means that the model simply performs better in obtaining information on the training data set, but when It performs poorly on application and test datasets. Whether the model shows underfitting can be seen directly from the performance of the model on the training data set, and whether there is overfitting needs to use the test data set to verify the model, so usually when using a certain After the data set is trained, this model must be applied to another data set with the same data structure. If the application effect of this model on the new data set is excellent, the training of this model is considered successful.
Because in the actual research process, using two sets of data sets with the same data structure but different specific data information to conduct related experiments will increase the complexity of the experiment, so the data sets that need to be tested are usually grouped and a part of the data is selected. The set is used as the training data set to train the model, and the remaining part of the data set is used as the test data set to verify the trained model. This verification method is called cross validation (Cross Validation, referred to as CV). This method is mainly used to estimate the accuracy of the trained model in actual data prediction or data classification, so as to judge the generality of the model. Whether the ability to transform is excellent. Cross-validation can evaluate the performance of the trained model to determine the best performing model on this data set. In addition, it can also reduce the overfitting and underfitting of the model to the data to a certain extent. From the new Get more effective information from the data.
There are four main cross-validation methods commonly used at present, they are random sub-sampling validation (Hold-out Method), K-fold cross-validation (K-fold Cross Validation), leave-one-out Cross Validation (Leave-one-out Cross Validation) ) and self-service sampling verification (Bootstrapping).

1. Random sub-sampling verification

Random sub-sampling verification was proposed earlier than other cross-validation methods. Strictly speaking, random sub-sampling verification does not cross-use the data set. It involves simply dividing the original data set into two groups at random. One group serves as The training data set is used to train the model, and one set is used as the test data set for the verification process of the model. This method is relatively simple to operate, and only involves random grouping of data sets, but when the data sets are grouped using this method, the data distribution between the training data set and the original data set or the test data set and the training data set may be There will be a big difference, so it is easy to cause the trained model to perform well only on the training data set, but to perform poorly on the test data set, that is, the verification result of the final test data set is very different from the random grouping of the data set. big relationship. Due to the existence of the above problems, the verification results obtained by this method are sometimes not convincing.
insert image description here

2. K-fold cross-validation

K-fold cross-validation is a method improved on the basis of random sub-sampling verification. Unlike random sub-sampling verification, this method divides the original data set into K groups randomly and evenly without repeated sampling, thereby obtaining K sub-data sets. , and then carry out the training and testing process of the model for K times. In each training and testing process, one of the K sub-data sets is selected as the test data set of the model, and the remaining K-1 sub-data sets are used as the training data set of the model, and then related experiments are carried out to obtain the verification evaluation on the verification data set result. Finally, the verification evaluation results on the K test data sets are averaged to obtain the final verification evaluation results.
When using K-fold cross-validation to verify the performance of the model, each sample data in the original data set is used as both the training data set and the test data set, so overfitting can be avoided in the verification results , which is more convincing than random sub-sampling verification. But it still has a shortcoming, that is, there is no unified method for determining the K value. Generally, the K value is set to 2 only when the amount of data is small, and the positive value of 3 or above is selected at other times. integer.
insert image description here

3. Leave-one-out cross-validation

The central idea of ​​the leave-one-out cross-validation method is the same as that of the K-fold cross-validation method, which allows each sample data in the data set to be used as both training data and test data. The difference is that the K-fold cross-validation uses each test data as a data set, while the leave-one-out cross-validation method uses a single sample data.
Assuming that there are N sample data in the original data set, the model will be trained and tested for N times. During each training and testing process, each sample data in the original data set will be selected in turn as the test data for the test process, and the rest N-1 sample data will be used as the training data set for the training process, and finally the average calculation is performed on the N verification evaluation results to obtain the final verification evaluation result.
The biggest advantage of this method is that all samples are used for model training during the cross-validation process, so the distribution characteristics of the training data set samples are closest to the original sample data set, and the evaluation results obtained on this basis are more reliable. However, since each sample data is used as test data in turn, if the total sample size N of the original data set is a large value, the resulting model calculation amount and model training time will be relatively large, so the calculation cost is relatively large. higher.
insert image description here

4. Self-service sampling verification

Bootstrap sampling validation is a special cross-validation method that uses less, and it is generally applied to data sets with a small amount of data. The biggest feature of this method is that the training data set is selected from the original data set through random sampling with replacement, and the unselected data is used as the test data set. When the data set is small, this method can effectively help the training data. set and test dataset.
Assuming that there are N sample data in the original data set, then the original data set will be randomly drawn N times with replacement, and then the N extracted sample data will be used as the training data set, and the unextracted sample data in the original data set will be The sample data is used as the test data set. Generally speaking, when the value of N approaches infinity, about 36.8% of the sample data is used as the test data set.
Using this method, the final training data set is exactly the same size as the original data set, and the test data set accounts for about 1/3 of the training data set, so the division ratio between the training data set and the test data set is more reasonable, but once the data is too large When , the data distribution of the training data set will have a large discrepancy with the original data set, which will have a relatively large impact on the verification results of the model.
insert image description here

Guess you like

Origin blog.csdn.net/weixin_42051846/article/details/129441378