利用sklearn的Pipeline简化建模过程

很多框架都会提供一种Pipeline的机制,通过封装一系列操作的流程,调用时按计划执行即可。比如netty中有ChannelPipeline,TensorFlow的计算图也是如此。

下面简要介绍sklearn中pipeline的使用:

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# 定义类别型特征预处理器
categorical_transformer=Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('onehot',OneHotEncoder(handle_unknown='ignore'))
])

# 定义数值型特征预处理器
numerical_transformer=SimpleImputer(strategy='constant')

# 将类别与数值型特征预处理器,分别应用于对应列上
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, ['Age']),
        ('cat', categorical_transformer, ['Embarked'])
    ])

# 定义Pipeline,传入预处理器与选择的模型
my_pipeline=Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('model',RandomForestClassifier(n_estimators=100,random_state=0))
])

# 使用pipeline
X_train,X_valid,y_train,y_valid=train_test_split(X,y,test_size=0.2,random_state=0)
my_pipeline.fit(X_train.copy(),y_train.copy())# 训练,预处理会改变原始数据,不想改变copy一下
preds=my_pipeline.predict(X_valid)# 预测

猜你喜欢

转载自www.cnblogs.com/lunge-blog/p/11940377.html