Cross-validation of learning python 11

  This study, we will learn how to use cross-validation to better measure of model performance.

  1 Introduction

    Machine learning is an iterative process. We will face what variables predict what type of model to use, what parameters to provide these models to choose from.

    So far, by using the validation set (or holdout) measure the quality of the model, we have data-driven ways to make these choices.

    To see this, suppose you have a 5,000-line data set. You will typically save about 20% of the data for the validation data set, namely 1000 lines. But in terms of determining the model scores left some random chance.

    In other words, a model on a set of lines 1000 may do well, even if it is on another group of 1,000 lines is not accurate. In extreme cases, we can imagine only one line of data validation concentrated.

    In general, the greater the validation set, we measure the quality of the model of randomness (or "noise"), the smaller, the more reliable.

    Unfortunately, we can only get a larger validation set by deleting rows from the training data, while smaller training data set means that poor model

  2. What is the cross-validation

    In cross-validation, we run the modeling process a different subset of data, to obtain a plurality of model quality metrics.

    For example, we can divide the data into five parts, each accounting for 20% of the entire data set.

    In the present embodiment, the data will be divided into five "collapsed."

    Then we run an experiment for each fold:

    In Experiment 1, we first used as a test set (or The Holdout), everything else folded as training data. This gives us a model based on quality metrics 20% holdout set.

    In Experiment 2, we hold the data from the second folding (and all methods except the second folding to train the model). Then use the holdout set of model quality for the second estimate.

    We repeat this process, using each folded once as resistance. With this, the data is used as a 100% resistance to some extent, our model resulting mass, based on all of the line data is set (even if we do not use all rows at the same time).

  3. When should you use cross-validation?

    Cross-validation provides a more accurate measure of the quality of the model, if we do a lot of modeling decision-making, which is particularly important. However, it may take longer to run because it is estimated more than one model (each folded a).

    So, taking into account these trade-offs, we should use every method at what time it? For small data sets, additional computational burden is not large, we should run cross-validation.

    For large data sets, it is sufficient to set a verification. Our code will run faster, and we may have enough data, so no need to reuse some of the data which is to be retained.

    No simple threshold for small and large data sets consisting of data sets. But if our model takes a few minutes or even less time to run, then switched to cross-validation may be worth it.

    Or, we can run cross-validation to see if the score is close to each experiment. If the same result is generated for each experiment, a validation set may be sufficient.

  4, illustrated

    1), with data still have to study the data before the Melbourne housing situation.

        We load the input data X, the output data is loaded with y.

import pandas as pd
import csv
data = pd.read_csv('E:/data_handle/melb_data.csv')

cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

y = data.Price

    2) Create Pipeline

        Then, we define a pipeline, which uses an infuser to fill in missing values ​​and a Random Forest model to predict.

           Although you can cross-check in the absence of the pipeline situation, but it is very difficult! Will use the pipeline code is very simple.
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor',SimpleImputer()),
                              ('model',RandomForestRegressor(n_estimators=50,random_state=0))])

    3) to obtain MAE scores

from sklearn.model_selection import cross_val_score
#乘以-1,因为sklearn计算* - * MAE
score = -1 * cross_val_score(my_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')
print("MAE score:\n", score)

    4) , obtaining an average value of MAE

        Rating model parameter selection to report quality metrics:

          In this example, we chose the negative mean absolute error (MAE). scikit-learn document shows a list of options.

            We specify a negative MAE bit strange. Scikit-learn there is a convention in which all indicators are defined, so a high figure better.
          Here you can use negative words to make them consistent with the convention, although hardly heard in other parts of negative words MAE.
            We usually need a single model quality metric to compare to other models. We take the average of all experiments.
print("Average MAE score (across experiments):")
print(scores.mean())
    5), summary
        Use cross-validation of the model produces a better quality metrics, and brings the added benefit of cleaning up the code:
          Note that we no longer need to track separate training and validation set. Therefore, especially for small data set, which is a good improvement!
  5, the total code
      
import pandas as pd
import csv

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.model_selection import cross_val_score

data = pd.read_csv('E:/data_handle/melb_data.csv')

cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

y = data.Price

my_pipeline = Pipeline(steps=[('preprocessor',SimpleImputer()),
                              ('model',RandomForestRegressor(n_estimators=50,random_state=0))])
#乘以-1,因为sklearn计算* - * MAE
scores = -1 * cross_val_score(my_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')
print("MAE score:\n", scores)

print("Average MAE score (across experiments):")
print(scores.mean())

 

The learning is over! ! ! ! !

Come on , although it is difficult, but stick to it should be able to learn.

Guess you like

Origin www.cnblogs.com/fb1704011013/p/11203161.html