pipeline.Pipeline
推荐博客链接:https://blog.csdn.net/lanchunhui/article/details/50521648
1. sklearn pipeline的使用
(1)简介
当我们对训练集应用各种预处理操作时(特征标准化、主成分分析等等),
我们都需要对测试集重复利用这些参数。
pipeline 实现了对全部步骤的流式化封装和管理,可以很方便地使参数集在新数据集上被重复使用。
pipeline 可以用于下面几处:
模块化 Feature Transform,只需写很少的代码就能将新的 Feature 更新到训练集中。
自动化 Grid Search,只要预先设定好使用的 Model 和参数的候选,就能自动搜索并记录最佳的 Model。
自动化 Ensemble Generation,每隔一段时间将现有最好的 K 个 Model 拿来做 Ensemble。
(2)例子:
注意pipeline中间每一步是 transformer,即它们必须包含 fit 和 transform 方法,或者 fit_transform。
最后一步是一个 Estimator,即最后一步模型要有 fit 方法,可以没有 transform 方法。
然后用 Pipeline.fit对训练集进行训练,pipe_lr.fit(X_train, y_train)
再直接用 Pipeline.score 对测试集进行预测并评分 pipe_lr.score(X_test, y_test)
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
# 获取iris数据集
iris = load_iris()
X_data = iris.data
y_data = iris.target
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, \
test_size = 0.25, random_state = 1)
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# 构建pipeline
pipe_lr = Pipeline([('sc', StandardScaler()),
('pca', PCA(n_components=2)),
('clf', LogisticRegression(random_state=1))
])
pipe_lr.fit(X_train, y_train)
print('Test accuracy: %.3f' % pipe_lr.score(X_test, y_test))
Test accuracy: 0.842
2.preprocessing.PolynomialFeatures用于特征构建
使用sklearn.preprocessing.PolynomialFeatures来进行特征的构造。
它是使用多项式的方法来进行的,如果有a,b两个特征,那么它的2次多项式为(1,a,b,a^2,ab, b^2)。
PolynomialFeatures有三个参数
degree:控制多项式的度
interaction_only: 默认为False,如果指定为True,那么就不会有特征自己和自己结合的项,上面的二次项中没有a2和b2。
include_bias:默认为True。如果为True的话,那么就会有上面的 1那一项。
from sklearn.preprocessing import PolynomialFeatures
X_train = [[1],[2],[3],[4]]
quadratic_featurizer_2 = PolynomialFeatures(degree=2)
X_train_quadratic_2 = quadratic_featurizer_2.fit_transform(X_train)
print("feature names")
print(quadratic_featurizer_2.get_feature_names())
print(X_train_quadratic_2)
quadratic_featurizer_3 = PolynomialFeatures(degree=3)
X_train_quadratic_3 = quadratic_featurizer_3.fit_transform(X_train)
print("feature names")
print(quadratic_featurizer_3.get_feature_names())
print(X_train_quadratic_3)
X_train = [[1,3],[2,6],[3,7],[4,8]]
quadratic_featurizer_2 = PolynomialFeatures(degree=2)
X_train_quadratic_2 = quadratic_featurizer_2.fit_transform(X_train)
print("feature names")
print(quadratic_featurizer_2.get_feature_names())
print(X_train_quadratic_2)
quadratic_featurizer_3 = PolynomialFeatures(degree=3)
X_train_quadratic_3 = quadratic_featurizer_3.fit_transform(X_train)
print("feature names")
print(quadratic_featurizer_3.get_feature_names())
print(X_train_quadratic_3)
输出
feature names
['1', 'x0', 'x0^2']
[[ 1. 1. 1.]
[ 1. 2. 4.]
[ 1. 3. 9.]
[ 1. 4. 16.]]
feature names
['1', 'x0', 'x0^2', 'x0^3']
[[ 1. 1. 1. 1.]
[ 1. 2. 4. 8.]
[ 1. 3. 9. 27.]
[ 1. 4. 16. 64.]]
feature names
['1', 'x0', 'x1', 'x0^2', 'x0 x1', 'x1^2']
[[ 1. 1. 3. 1. 3. 9.]
[ 1. 2. 6. 4. 12. 36.]
[ 1. 3. 7. 9. 21. 49.]
[ 1. 4. 8. 16. 32. 64.]]
feature names
['1', 'x0', 'x1', 'x0^2', 'x0 x1', 'x1^2', 'x0^3', 'x0^2 x1', 'x0 x1^2', 'x1^3']
[[ 1. 1. 3. 1. 3. 9. 1. 3. 9. 27.]
[ 1. 2. 6. 4. 12. 36. 8. 24. 72. 216.]
[ 1. 3. 7. 9. 21. 49. 27. 63. 147. 343.]
[ 1. 4. 8. 16. 32. 64. 64. 128. 256. 512.]]
参考链接:
https://blog.csdn.net/tiange_xiao/article/details/79755793
https://www.cnblogs.com/magle/p/5881170.html