Pipe model (Pipeline)

1. Use make_blobs generates data set, the data set is then pre-

# Import data set generator 
from sklearn.datasets Import make_blobs 
# Import splitting tool data set 
from sklearn.model_selection Import train_test_split 
# preprocessing tool introduced 
from sklearn.preprocessing Import StandardScaler 
# Import multilayer perceptron neural network 
from sklearn.neural_network import MLPClassifier 
# import drawing tools 
import matplotlib.pyplot as plt

Generating a number of samples # 200, classified as a 2, a standard deviation of the data set 5 
X-, Y = make_blobs (N_SAMPLES = 200, 2 = Centers, cluster_std = 5) 
# split the data set into training and test sets 
X_train, X_test , y_train, android.permission.FACTOR. train_test_split = (X-, Y, = 38 is random_state) 
# preprocessing of the data 
Scaler = StandardScaler (). Fit (X_train) 
X_train_scaled = scaler.transform (X_train) 
X_test_scaled = scaler.transform (X_test) 
# processing morphology data for printing 
Print ( '\ n-\ n-\ n-') 
Print ( 'code running results') 
Print (' ====================== ============== \ n-') 
# the data processing form after printing 
print (' training data set: '. format (X_train_scaled.shape, { })' tag form: { } '. the format (X_test_scaled.shape)) 
Print (' \ n-=================================== = ') 
Print (' \ n-\ n-\ n-')

Results code runs 
==================================== 
training data set: (150, 2) form tag : (50, 2) 
====================================

# Raw training data set 
plt.scatter (X_train [:, 0], X_train [:,. 1]) 
# trained set pretreated 
plt.scatter (X_train_scaled [:, 0] , X_train_scaled [:, 1], marker = '^', edgecolor = 'K') 
# Add FIG title 
plt.title ( 'Scaled & Training Training SET SET') 
# display picture 
plt.show ()

Here you can see, StandardScaler the training data set has become more "gather"

# Guiding grid search 
from sklearn.model_selection Import GridSearchCV 
# set the grid model parameters dictionary search 
params = { 'hidden_layer_sizes': [ (50,), (100,), (100,100)], 'alpha': [0.0001 , 0.01, 0.1]} 
# grid search model established 
grid = GridSearchCV (MLPClassifier (max_iter = 1600, = 38 is random_state), the params = param_grid, CV =. 3, IID = False) 
# fitting the data 
grid.fit (X_train_scaled, y_train ) 
# print results 
Print ( '\ n-\ n-\ n-') 
Print ( 'code running results') 
Print (' ====================== ============== \ n-') 
Print ( "model best score: {:. 2f}' format ( grid.best_score_.), ' optimal parameters of the model: {}'. the format (grid.best_params_)) 
Print ( '\ n-====================================') 
print ( '\ n \ n \ n')

Code running results 
==================================== 
model best score: 0.81 optimal model parameters: { 'Alpha': from 0.0001, 'hidden_layer_sizes': (50,)} 
================================== ==

# Print model scores in the test set 
Print ( '\ n-\ n-\ n-') 
Print ( 'code running results') 
Print (' ==================== ================ \ n-') 
Print (' set test score: {} '. the format (grid.score (X_test_scaled, android.permission.FACTOR.))) 
Print (' \ n-== ================================== ') 
Print (' \ n-\ n-\ n-')

Results code runs 
==================================== 
test set Score: 0.82 
====== ==============================

This practice may see a high score model, but think about this approach is wrong, we cross-validation, the training set and split into training fold and validation fold, but pretreatment with StandardScaler when it is used for validation fold training fold and fit together. As a result, cross-validation score is inaccurate.

2. Use pipe model (the Pipeline)

Model # introduction line 
from the Pipeline Import sklearn.pipeline 
# create a channel model and a neural network comprising a preprocessing 
pipeline = Pipeline ([( 'scaler ', StandardScaler ()), ( 'mlp', MLPClassifier (max_iter = 1600, random_state = 38 ))]) 
# duct fitting model training set 
pipeline.fit (X_train, y_train) 
fraction tube model # printing 
print ( 'MLP model using tube model Rating: {:. 2f}' format ( pipeline.. score (X_test, y_test)))

Use tube model MLP model rating: 0.82

We used two methods Pipeline pipe model, one is used for data preprocessing StandardScaler. Another is the maximum number of iterations 1600 MLP multilayer perceptron neural network.

3. Use the piping model grid search

Results GridSearchCV split training and validation sets, not split train_test_split training and test sets, but in the training set train_test_split split again split, the resulting

Parameter Set # dictionary -------- (mlp__ mlp algorithm specified for the pipeline) 
the params = { 'mlp__hidden_layer_sizes': [(50,), (100,), (100,100)],' mlp__alpha ': [0.0001,0.001,0.01,0.1]} 
# create a channel model and a neural network comprising a preprocessing 
pipeline = pipeline ([(' scaler ', StandardScaler ()), (' mlp ', MLPClassifier (max_iter = 1600, random_state = 38))]) 
# pipe model will join the grid search 
grid = GridSearchCV (pipeline, the params = param_grid, CV =. 3, IID = False) 
# fit the training set 
grid.fit (X_train, y_train) 
# print model cross-validation score. best parameter set and test scores 
Print ( '\ n-\ n-\ n-') 
Print ( 'code running results') 
Print (' ================= =================== \ n-') 
Print (' cross-validation highest score: {}. 2F :. 'the format (grid.best_score_.)) 
Print (' best model there parameters: {} '. the format (grid.best_params_)) 
Print (' test set score:. {} 'format (grid .score(X_test,y_test))) 
Print ('\n====================================')
print('\n\n\n')

Code running results 
==================================== 
cross-validation highest score: 0.80 
model most parameters: { 'mlp__alpha': 0.0001, ' mlp__hidden_layer_sizes': (50,)} 
test set score: 0.82 
============================= =======

In hidden_layer_sizes and alpha are added in front of such a prefix mlp__, do so in order to have multiple algorithms pipeline, we need to know the parameters of pipeline which is passed to the algorithm.

Step # pipe model print 
Print ( '\ n-\ n-\ n-') 
Print ( 'code running results') 
Print (' ===================== =============== \ n-') 
Print (pipeline.steps) 
Print (' \ n-==================== ================ ') 
Print (' \ n-\ n-\ n-')

代码运行结果
====================================
[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('mlp', MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=1600, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=38, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False))]
====================================

to sum up:

　　In addition to being able to put more algorithms to integrate, implement the code simple, we also avoid the pipe models in the pretreatment process, the improper use of the way the training and validation sets error pretreatment. By using pipe model, grid search may be performed before each split training set and validation set, re-training and validation sets pretreatment operation, to avoid over-fitting the model case.

Quoted from the article: "layman's language python machine learning"