R machine learning: Understanding the status of repeated sampling in the process of machine learning model building

When doing a machine learning project, we will divide the data set into a training set and a test set at the beginning. Remember that the test set can only be used once and can only be used to evaluate the final best model. If you repeatedly use the test set and pick the best one after repeated testing, you are playing hooligans.

There must be model adjustments in the modeling process, which inevitably involves the problem of model selection. When I need to make many models in the process, the problem arises. If I don’t evaluate, how do I know which model is the best?

Typically we can’t decide on which final model to use with the test set before first assessing model performance. There is a gap between our need to measure performance reliably and the data splits (training and testing) we have available.

Think about how to add an evaluation process before using the test set to help us determine which model is the best and worthy of being used on the test set.

This process involves repeated sampling resampling!

Resampling methods, such as cross-validation and the bootstrap, are empirical simulation systems. They create a series of data sets similar to the training/testing split

First understand overfitting

Before writing repeated sampling, let's review the concept of overfitting. After the data is divided, we will train the model in the training set. How to evaluate this model? Naturally, I can think of it, just use the model in the training set and compare the real value with the predicted value. There are articles that do this, but now there are many black-box models that can almost completely replicate the training set, so that the prediction of the training set is unbiased. At this time, the black-box model must be good?

bias is the difference between the true pattern or relationships in data and the types of patterns that the model can emulate. Many black-box machine learning models have low bias, meaning they can reproduce complex relationships. Other models (such as linear/logistic regression, discriminant analysis, and others) are not as adaptable and are considered high bias models

uncertain. Let's take a practical example.

For the same data set, I made two models, one linear regression lm_fit, another random forest rf_fit, their performance in the training set is as follows:

Looking at the above figure, it is obvious that from the two indicators of rmse and rsq, it is suggested that the random forest model performs better in the training set. According to the above logic, I should choose the random forest model.

So I really think that the random forest model is better than the linear regression model, and then I used the random forest model in the test set to finally evaluate the model performance, and the results are as follows.

The results show that rmse ran from 0.03 to 0.07 relative to the training set, and the r square also dropped significantly.

At this point, according to the original idea, my work is actually over, and I simply think that it is indeed right for me to choose random forest, and the predictive ability of the model is indeed only like this.

Might as well go one step further.

Although I just said that the linear model is not as good as the random forest model, I am curious about how this model performs in an unfamiliar test set? So I took one more step and used the linear model we discarded in the test set to see the performance:

It can be seen that the performance consistency of the linear model in the training set and the test set is very strong, and the performance in the test set is actually not much different from that of the random forest.

The inspiration from the above example is that a well-trained model (good performance in the training set) does not mean that it is also good in the test set. The model performs well in the training set, but not in the test set, which is the performance of the model overfitting. The method to avoid overfitting during model training and ensure the consistency of performance is repeated sampling training.

Let's look at repeated sampling

The logic of repeated sampling training is:

We will repeatedly sample the original training set to form many sum samples.

For each sampling sample, it will be divided into analysis sample set and assessment sample set. We will train the model in the analysis sample, and then evaluate the model in the assessment sample. For example, if I repeat sampling 20 now, it means that I want to make 20 models. Each model is evaluated once, and it will be evaluated 20 times. Whether the overall model is good or not depends on the average of these 20 times. This greatly increases the generalization robustness of the model and avoids overfitting.

Common methods for repeated sampling include cross-validation and bootstrap sampling validation , the codes of which are as follows:

folds <- vfold_cv(cell_train, v = 10) #交叉验证设置代码

Cross-validation

Cross-validation is a method of resampling. A simple example is as follows. For example, I have a training set of 30 samples and a diagram of 3-fold cross-validation:

The 30 data sets are equally divided into 3 parts, and each part is regarded as an assessment data set. Correspondingly, the remaining 2 data sets are analysis data sets for training the model.

After the data is randomly cut into 3 parts, each part is used to evaluate the model performance.

Think about it carefully, the above cross-validation actually has randomness, that is, you cut the data into 3 parts at the beginning. If you only cut it once, it is actually random, so we must consider this when we actually use cross-validation. We will repeat it many times, such as 10-fold cross-validation and then repeat it 10 times. This is the idea of ​​repeated cross-validation, called Repeated cross-validation. This is why the cross-validation function has a repeats parameter.

Bootstrapping

Bootstrap itself is a method of determining the sample distribution of statistics, as mentioned in the previous article.

Bootstrap resampling was originally invented as a method for approximating the sampling distribution of statistics whose theoretical properties are intractable

In machine learning, we self-sampling the training set is to randomly select a sample as large as the training set in the training set with replacement. Similarly, let's look at an example of self-service sampling for a training set of 30 samples:

It can be seen that we conducted self-service sampling 3 times on the original 30 training set samples, and the 30 samples drawn each time were repeated. For example, the sample 8 was repeated at the first time, but the sample 2 was not drawn. In this way, we let the self-service samples be used for training, and the samples that were not drawn are used as the assessment set. The sample that was not drawn is also called out-of-bag  sample. The out-of-bag verification in the paper refers to this meaning.

rolling sampling

For time-dependent data, such as panel data, we must consider the order of time when we consider sampling. At this time, the method we use is called Rolling forecast origin resampling: the following is a diagram of this method:

It can be seen that our sampling is progressing by time, ensuring that every time we use old data for training and new data for evaluation. The above example is to lose one sample each time and advance one sample. In actual use, we can advance multiple samples at a time without losing them.

Understanding the Status of Random Sampling

The different repeated sampling methods are recalled again above. What always needs to be remembered is that repeated sampling is used to find the optimal model and to reduce underfitting and overfitting (many students actually skip this step when making prediction models. It can only be said that it is not perfect and cannot be wrong). Using repeated sampling, we will train the model in each sample set and evaluate it. For example, if I draw 20 sample sets by a certain sampling method, I will train and evaluate the model 20 times, and finally the average performance of the 20 models will be the performance of the model. . In this way, we try our best to make the model tested in the test set optimal, ensuring that the test set is only used once and this time does reflect the performance of the optimal model.

This sequence repeats for every resample. If there are B resamples, there are B replicates of each of the performance metrics. The final resampling estimate is the average of these B statistics. If B = 1, as with a validation set, the individual statistics represent overall performance.

How to use this method? tidymodels gives us the corresponding user interface:

model_spec %>% fit_resamples(formula,  resamples, ...)
model_spec %>% fit_resamples(recipe,   resamples, ...)
workflow   %>% fit_resamples(          resamples, ...)

If you don't understand the above interface, I will write a tidymodels framework for you in the future, please continue to pay attention.

Guess you like

Origin blog.csdn.net/tm_ggplot2/article/details/128975668