This study, we will learn how to use cross-validation to better measure of model performance.
1 Introduction
Machine learning is an iterative process. We will face what variables predict what type of model to use, what parameters to provide these models to choose from.
So far, by using the validation set (or holdout) measure the quality of the model, we have data-driven ways to make these choices.
To see this, suppose you have a 5,000-line data set. You will typically save about 20% of the data for the validation data set, namely 1000 lines. But in terms of determining the model scores left some random chance.
In other words, a model on a set of lines 1000 may do well, even if it is on another group of 1,000 lines is not accurate. In extreme cases, we can imagine only one line of data validation concentrated.
In general, the greater the validation set, we measure the quality of the model of randomness (or "noise"), the smaller, the more reliable.
Unfortunately, we can only get a larger validation set by deleting rows from the training data, while smaller training data set means that poor model
2. What is the cross-validation
In cross-validation, we run the modeling process a different subset of data, to obtain a plurality of model quality metrics.
For example, we can divide the data into five parts, each accounting for 20% of the entire data set.
In the present embodiment, the data will be divided into five "collapsed."
Then we run an experiment for each fold:
In Experiment 1, we first used as a test set (or The Holdout), everything else folded as training data. This gives us a model based on quality metrics 20% holdout set.
In Experiment 2, we hold the data from the second folding (and all methods except the second folding to train the model). Then use the holdout set of model quality for the second estimate.
We repeat this process, using each folded once as resistance. With this, the data is used as a 100% resistance to some extent, our model resulting mass, based on all of the line data is set (even if we do not use all rows at the same time).
3. When should you use cross-validation?
Cross-validation provides a more accurate measure of the quality of the model, if we do a lot of modeling decision-making, which is particularly important. However, it may take longer to run because it is estimated more than one model (each folded a).
So, taking into account these trade-offs, we should use every method at what time it? For small data sets, additional computational burden is not large, we should run cross-validation.
For large data sets, it is sufficient to set a verification. Our code will run faster, and we may have enough data, so no need to reuse some of the data which is to be retained.
No simple threshold for small and large data sets consisting of data sets. But if our model takes a few minutes or even less time to run, then switched to cross-validation may be worth it.
Or, we can run cross-validation to see if the score is close to each experiment. If the same result is generated for each experiment, a validation set may be sufficient.
4, illustrated
1), with data still have to study the data before the Melbourne housing situation.
We load the input data X, the output data is loaded with y.
import pandas as pd import csv data = pd.read_csv('E:/data_handle/melb_data.csv') cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt'] X = data[cols_to_use] y = data.Price
2) Create Pipeline
Then, we define a pipeline, which uses an infuser to fill in missing values and a Random Forest model to predict.
from sklearn.ensemble import RandomForestRegressor from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer my_pipeline = Pipeline(steps=[('preprocessor',SimpleImputer()), ('model',RandomForestRegressor(n_estimators=50,random_state=0))])
3) to obtain MAE scores
from sklearn.model_selection import cross_val_score #乘以-1,因为sklearn计算* - * MAE score = -1 * cross_val_score(my_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error') print("MAE score:\n", score)
4) , obtaining an average value of MAE
Rating model parameter selection to report quality metrics:
In this example, we chose the negative mean absolute error (MAE). scikit-learn document shows a list of options.
print("Average MAE score (across experiments):") print(scores.mean())
import pandas as pd import csv from sklearn.ensemble import RandomForestRegressor from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_score data = pd.read_csv('E:/data_handle/melb_data.csv') cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt'] X = data[cols_to_use] y = data.Price my_pipeline = Pipeline(steps=[('preprocessor',SimpleImputer()), ('model',RandomForestRegressor(n_estimators=50,random_state=0))]) #乘以-1,因为sklearn计算* - * MAE scores = -1 * cross_val_score(my_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error') print("MAE score:\n", scores) print("Average MAE score (across experiments):") print(scores.mean())
The learning is over! ! ! ! !
Come on , although it is difficult, but stick to it should be able to learn.