sklearn中的pipeline.Pipeline以及preprocessing.Polynomialfeatures的解释与应用

pipeline.Pipeline
推荐博客链接:https://blog.csdn.net/lanchunhui/article/details/50521648
1. sklearn pipeline的使用

(1)简介

当我们对训练集应用各种预处理操作时(特征标准化、主成分分析等等),
我们都需要对测试集重复利用这些参数。

pipeline 实现了对全部步骤的流式化封装和管理,可以很方便地使参数集在新数据集上被重复使用。

pipeline 可以用于下面几处:

模块化 Feature Transform,只需写很少的代码就能将新的 Feature 更新到训练集中。

自动化 Grid Search,只要预先设定好使用的 Model 和参数的候选,就能自动搜索并记录最佳的 Model。

自动化 Ensemble Generation,每隔一段时间将现有最好的 K 个 Model 拿来做 Ensemble。

(2)例子:

注意pipeline中间每一步是 transformer,即它们必须包含 fit 和 transform 方法,或者 fit_transform。

最后一步是一个 Estimator,即最后一步模型要有 fit 方法,可以没有 transform 方法。

扫描二维码关注公众号,回复: 6418553 查看本文章

然后用 Pipeline.fit对训练集进行训练,pipe_lr.fit(X_train, y_train)
再直接用 Pipeline.score 对测试集进行预测并评分 pipe_lr.score(X_test, y_test)

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris 
 
# 获取iris数据集
iris = load_iris()
X_data = iris.data
y_data = iris.target
 
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, \
                                                    test_size = 0.25, random_state = 1)
 
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
 
# 构建pipeline
pipe_lr = Pipeline([('sc', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                    ('clf', LogisticRegression(random_state=1))
                    ])
pipe_lr.fit(X_train, y_train)
print('Test accuracy: %.3f' % pipe_lr.score(X_test, y_test))
 
Test accuracy: 0.842

2.preprocessing.PolynomialFeatures用于特征构建
使用sklearn.preprocessing.PolynomialFeatures来进行特征的构造。

它是使用多项式的方法来进行的,如果有a,b两个特征,那么它的2次多项式为(1,a,b,a^2,ab, b^2)。

PolynomialFeatures有三个参数

degree:控制多项式的度

interaction_only: 默认为False,如果指定为True,那么就不会有特征自己和自己结合的项,上面的二次项中没有a2和b2。

include_bias:默认为True。如果为True的话,那么就会有上面的 1那一项。

from sklearn.preprocessing import PolynomialFeatures
X_train = [[1],[2],[3],[4]]
quadratic_featurizer_2 = PolynomialFeatures(degree=2)
X_train_quadratic_2 = quadratic_featurizer_2.fit_transform(X_train)
print("feature names")
print(quadratic_featurizer_2.get_feature_names())
print(X_train_quadratic_2)
 
quadratic_featurizer_3 = PolynomialFeatures(degree=3)
X_train_quadratic_3 = quadratic_featurizer_3.fit_transform(X_train)
print("feature names")
print(quadratic_featurizer_3.get_feature_names())
print(X_train_quadratic_3)
 
X_train = [[1,3],[2,6],[3,7],[4,8]]
quadratic_featurizer_2 = PolynomialFeatures(degree=2)
X_train_quadratic_2 = quadratic_featurizer_2.fit_transform(X_train)
print("feature names")
print(quadratic_featurizer_2.get_feature_names())
print(X_train_quadratic_2)
 
quadratic_featurizer_3 = PolynomialFeatures(degree=3)
X_train_quadratic_3 = quadratic_featurizer_3.fit_transform(X_train)
print("feature names")
print(quadratic_featurizer_3.get_feature_names())
print(X_train_quadratic_3)

输出

feature names
['1', 'x0', 'x0^2']
[[  1.   1.   1.]
 [  1.   2.   4.]
 [  1.   3.   9.]
 [  1.   4.  16.]]
feature names
['1', 'x0', 'x0^2', 'x0^3']
[[  1.   1.   1.   1.]
 [  1.   2.   4.   8.]
 [  1.   3.   9.  27.]
 [  1.   4.  16.  64.]]
feature names
['1', 'x0', 'x1', 'x0^2', 'x0 x1', 'x1^2']
[[  1.   1.   3.   1.   3.   9.]
 [  1.   2.   6.   4.  12.  36.]
 [  1.   3.   7.   9.  21.  49.]
 [  1.   4.   8.  16.  32.  64.]]
feature names
['1', 'x0', 'x1', 'x0^2', 'x0 x1', 'x1^2', 'x0^3', 'x0^2 x1', 'x0 x1^2', 'x1^3']
[[   1.    1.    3.    1.    3.    9.    1.    3.    9.   27.]
 [   1.    2.    6.    4.   12.   36.    8.   24.   72.  216.]
 [   1.    3.    7.    9.   21.   49.   27.   63.  147.  343.]
 [   1.    4.    8.   16.   32.   64.   64.  128.  256.  512.]]

参考链接:

https://blog.csdn.net/tiange_xiao/article/details/79755793

https://www.cnblogs.com/magle/p/5881170.html

https://blog.csdn.net/ssdut_209/article/details/81869795

https://blog.csdn.net/zhuzuwei/article/details/80956787

猜你喜欢

转载自blog.csdn.net/weixin_42542536/article/details/90322216