Column | Jupyter based on the characteristics of engineering manual: data preprocessing (a)

Click on the top " AI proper way ", select the "top" number of public

Heavy dry goods, the first time served

Author: Yingxiang Chen Yang & Zihan

Editor: Red stone

The importance of engineering characteristics in machine learning is self-evident, appropriate engineering features can significantly improve the performance of machine learning model. Us on Github finishing the preparation of the project features a tutorial system, for your reference study.

project address:

https://github.com/YC-Coder-Chen/feature-engineering-handbook

This article will explore data preprocessing part: explains how to use scikit-learn process static and continuous variables, use Category Encoders handle static class variables and the use of Featuretools handle common time series variables.

table of Contents

Data pre-processing features of the project will be divided into three parts to introduce:

Static continuous variables
Static class variables
Time series variables

This article describes the data preprocessing 1.1 Static continuous variables. Below with Jupyter, using sklearn, for Comments.

1.1 Static continuous variables

1.1.1 Discretization

Discrete continuous variables can make the model more robust. For example, when predicting customer buying behavior, more than 30 times a customer buying behavior may be associated with an existing customer purchase behavior 32 times with a very similar behavior. Sometimes features over accuracy may be noise, which is why LightGBM, the model uses histogram algorithm to prevent over-fitting. There are two methods of discrete continuous variables.

1.1.1.1 Binarization

The binary characteristic value.

# load the sample data
from sklearn.datasets import fetch_california_housing
dataset = fetch_california_housing()
X, y = dataset.data, dataset.target # we will take the first column as the example later

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots()
sns.distplot(X[:,0], hist = True, kde=True)
ax.set_title('Histogram', fontsize=12)
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution

from sklearn.preprocessing import Binarizer


sample_columns = X[0:10,0] # select the top 10 samples
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


model = Binarizer(threshold=6) # set 6 to be the threshold
# if value <= 6, then return 0 else return 1
result = model.fit_transform(sample_columns.reshape(-1,1)).reshape(-1)
# return array([1., 1., 1., 0., 0., 0., 0., 0., 0., 0.])

1.1.1.2 Binning

The numerical characteristic binning.

Box homogeneously:

from sklearn.preprocessing import KBinsDiscretizer


# in order to mimic the operation in real-world, we shall fit the KBinsDiscretizer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform') # set 5 bins
# return oridinal bin number, set all bins to have identical widths


model.fit(train_set.reshape(-1,1))
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([2., 2., 2., 1., 1., 1., 1., 0., 0., 1.])
bin_edge = model.bin_edges_[0]
# return array([ 0.4999 ,  3.39994,  6.29998,  9.20002, 12.10006, 15.0001 ]), the bin edges

# visualiza the bin edges
fig, ax = plt.subplots()
sns.distplot(train_set, hist = True, kde=True)


for edge in bin_edge: # uniform bins
    line = plt.axvline(edge, color='b')
ax.legend([line], ['Uniform Bin Edges'], fontsize=10)
ax.set_title('Histogram', fontsize=12)
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12);

Quantile binning:

from sklearn.preprocessing import KBinsDiscretizer


# in order to mimic the operation in real-world, we shall fit the KBinsDiscretizer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile') # set 3 bins
# return oridinal bin number, set all bins based on quantile


model.fit(train_set.reshape(-1,1))
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([4., 4., 4., 4., 2., 3., 2., 1., 0., 2.])
bin_edge = model.bin_edges_[0]
# return array([ 0.4999 ,  2.3523 ,  3.1406 ,  3.9667 ,  5.10824, 15.0001 ]), the bin edges
# 2.3523 is the 20% quantile
# 3.1406 is the 40% quantile, etc..

# visualiza the bin edges
fig, ax = plt.subplots()
sns.distplot(train_set, hist = True, kde=True)


for edge in bin_edge: # quantile based bins
    line = plt.axvline(edge, color='b')
ax.legend([line], ['Quantiles Bin Edges'], fontsize=10)
ax.set_title('Histogram', fontsize=12)
ax.set_xlabel('Value', fontsize=12)
ax.set_ylabel('Frequency', fontsize=12);

1.1.2 Zoom

不同尺度的特征之间难以比较，特别是在线性回归和逻辑回归等线性模型中。在基于欧氏距离的 k-means 聚类或 KNN 模型中，就需要进行特征缩放，否则距离的测量是无用的。而对于任何使用梯度下降的算法，缩放也会加快收敛速度。

一些常用的模型：

注：偏度影响 PCA 模型，因此最好使用幂变换来消除偏度。

1.1.2.1 标准缩放（Z 分数标准化）

公式：

其中，X 是变量（特征），???? 是 X 的均值，???? 是 X 的标准差。此方法对异常值非常敏感，因为异常值同时影响到 ???? 和 ????。

from sklearn.preprocessing import StandardScaler


# in order to mimic the operation in real-world, we shall fit the StandardScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = StandardScaler()


model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 2.34539745,  2.33286782,  1.78324852,  0.93339178, -0.0125957 ,
# 0.08774668, -0.11109548, -0.39490751, -0.94221041, -0.09419626])
# result is the same as ((X[0:10,0] - X[10:,0].mean())/X[10:,0].std())

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = StandardScaler()
model.fit(X[:,0].reshape(-1,1)) 
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change
fig.tight_layout()

1.1.2.2 MinMaxScaler（按数值范围缩放）

假设我们要缩放的特征数值范围为 (a, b)。

公式：

其中，Min 是 X 的最小值，Max 是 X 的最大值。此方法也对异常值非常敏感，因为异常值同时影响到 Min 和 Max。

from sklearn.preprocessing import MinMaxScaler


# in order to mimic the operation in real-world, we shall fit the MinMaxScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = MinMaxScaler(feature_range=(0,1)) # set the range to be (0,1)


model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([0.53966842, 0.53802706, 0.46602805, 0.35469856, 0.23077613,
# 0.24392077, 0.21787286, 0.18069406, 0.1089985 , 0.22008662])
# result is the same as (X[0:10,0] - X[10:,0].min())/(X[10:,0].max()-X[10:,0].min())

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = MinMaxScaler(feature_range=(0,1))
model.fit(X[:,0].reshape(-1,1)) 
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change
fig.tight_layout() # now the scale change to [0,1]

1.1.2.3 RobustScaler（抗异常值缩放）

使用对异常值稳健的统计（分位数）来缩放特征。假设我们要将缩放的特征分位数范围为 (a, b)。

公式：

这种方法对异常点鲁棒性更强。

import numpy as np
from sklearn.preprocessing import RobustScaler


# in order to mimic the operation in real-world, we shall fit the RobustScaler
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = RobustScaler(with_centering = True, with_scaling = True, 
                    quantile_range = (25.0, 75.0))
# with_centering = True => recenter the feature by set X' = X - X.median()
# with_scaling = True => rescale the feature by the quantile set by user
# set the quantile to the (25%, 75%)


model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 2.19755974,  2.18664281,  1.7077657 ,  0.96729508,  0.14306683,
# 0.23049401,  0.05724508, -0.19003715, -0.66689601,  0.07196918])
# result is the same as (X[0:10,0] - np.quantile(X[10:,0], 0.5))/(np.quantile(X[10:,0],0.75)-np.quantile(X[10:,0], 0.25))

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = RobustScaler(with_centering = True, with_scaling = True, 
                    quantile_range = (25.0, 75.0))
model.fit(X[:,0].reshape(-1,1)) 
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution is the same, but scales change
fig.tight_layout()

1.1.2.4 幂次变换（非线性变换）

以上介绍的所有缩放方法都保持原来的分布。但正态性是许多统计模型的一个重要假设。我们可以使用幂次变换将原始分布转换为正态分布。

Box-Cox 变换：

Box-Cox 变换只适用于正数，并假设如下分布：

考虑了所有的 λ 值，通过最大似然估计选择稳定方差和最小化偏度的最优值。

from sklearn.preprocessing import PowerTransformer


# in order to mimic the operation in real-world, we shall fit the PowerTransformer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = PowerTransformer(method='box-cox', standardize=True)
# apply box-cox transformation


model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 1.91669292,  1.91009687,  1.60235867,  1.0363095 ,  0.19831579,
# 0.30244247,  0.09143411, -0.24694006, -1.08558469,  0.11011933])

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = PowerTransformer(method='box-cox', standardize=True)
model.fit(X[:,0].reshape(-1,1)) 
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution now becomes normal
fig.tight_layout()

Yeo-Johnson 变换：

Yeo Johnson 变换适用于正数和负数，并假设以下分布：

考虑了所有的 λ 值，通过最大似然估计选择稳定方差和最小化偏度的最优值。

from sklearn.preprocessing import PowerTransformer


# in order to mimic the operation in real-world, we shall fit the PowerTransformer
# on the trainset and transform the testset
# we take the top ten samples in the first column as test set
# take the rest samples in the first column as train set


test_set = X[0:10,0]
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])
train_set = X[10:,0]


model = PowerTransformer(method='yeo-johnson', standardize=True)
# apply box-cox transformation


model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set
# top ten numbers for simplification
result = model.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([ 1.90367888,  1.89747091,  1.604735  ,  1.05166306,  0.20617221,
# 0.31245176,  0.09685566, -0.25011726, -1.10512438,  0.11598074])

# visualize the distribution after the scaling
# fit and transform the entire first feature


import seaborn as sns
import matplotlib.pyplot as plt


fig, ax = plt.subplots(2,1, figsize = (13,9))
sns.distplot(X[:,0], hist = True, kde=True, ax=ax[0])
ax[0].set_title('Histogram of the Original Distribution', fontsize=12)
ax[0].set_xlabel('Value', fontsize=12)
ax[0].set_ylabel('Frequency', fontsize=12); # this feature has long-tail distribution


model = PowerTransformer(method='yeo-johnson', standardize=True)
model.fit(X[:,0].reshape(-1,1)) 
result = model.transform(X[:,0].reshape(-1,1)).reshape(-1)


# show the distribution of the entire feature
sns.distplot(result, hist = True, kde=True, ax=ax[1])
ax[1].set_title('Histogram of the Transformed Distribution', fontsize=12)
ax[1].set_xlabel('Value', fontsize=12)
ax[1].set_ylabel('Frequency', fontsize=12); # the distribution now becomes normal
fig.tight_layout()

1.1.3 正则化

以上所有缩放方法都是按列操作的。但正则化在每一行都有效，它试图“缩放”每个样本，使其具有单位范数。由于正则化在每一行都起作用，它会扭曲特征之间的关系，因此不常见。但是正则化方法在文本分类和聚类上下文中是非常有用的。

假设 X[i][j] 表示样本 i 中特征 j 的值。

L1 正则化公式：

L2 正则化公式：

L1 正则化：

from sklearn.preprocessing import Normalizer


# Normalizer performs operation on each row independently
# So train set and test set are processed independently


###### for L1 Norm
sample_columns = X[0:2,0:3] # select the first two samples, and the first three features
# return array([[ 8.3252, 41., 6.98412698],
# [ 8.3014 , 21.,  6.23813708]])


model = Normalizer(norm='l1')
# use L2 Norm to normalize each samples


model.fit(sample_columns) 


result = model.transform(sample_columns) # test set are processed similarly
# return array([[0.14784762, 0.72812094, 0.12403144],
# [0.23358211, 0.59089121, 0.17552668]])
# result = sample_columns/np.sum(np.abs(sample_columns), axis=1).reshape(-1,1)

L2 正则化：

###### for L2 Norm
sample_columns = X[0:2,0:3] # select the first three features
# return array([[ 8.3252, 41., 6.98412698],
# [ 8.3014 , 21.,  6.23813708]])


model = Normalizer(norm='l2')
# use L2 Norm to normalize each samples


model.fit(sample_columns) 


result = model.transform(sample_columns)
# return array([[0.19627663, 0.96662445, 0.16465922],
# [0.35435076, 0.89639892, 0.26627902]])
# result = sample_columns/np.sqrt(np.sum(sample_columns**2, axis=1)).reshape(-1,1)

# visualize the difference in the distribuiton after Normalization
# compare it with the distribuiton after RobustScaling
# fit and transform the entire first & second feature


import seaborn as sns
import matplotlib.pyplot as plt


# RobustScaler
fig, ax = plt.subplots(2,1, figsize = (13,9))


model = RobustScaler(with_centering = True, with_scaling = True, 
                    quantile_range = (25.0, 75.0))
model.fit(X[:,0:2]) 
result = model.transform(X[:,0:2])


sns.scatterplot(result[:,0], result[:,1], ax=ax[0])
ax[0].set_title('Scatter Plot of RobustScaling result', fontsize=12)
ax[0].set_xlabel('Feature 1', fontsize=12)
ax[0].set_ylabel('Feature 2', fontsize=12);


model = Normalizer(norm='l2')


model.fit(X[:,0:2]) 
result = model.transform(X[:,0:2])


sns.scatterplot(result[:,0], result[:,1], ax=ax[1])
ax[1].set_title('Scatter Plot of Normalization result', fontsize=12)
ax[1].set_xlabel('Feature 1', fontsize=12)
ax[1].set_ylabel('Feature 2', fontsize=12);
fig.tight_layout()  # Normalization distort the original distribution

1.1.4 缺失值的估算

在实际操作中，数据集中可能缺少值。然而，这种稀疏的数据集与大多数 scikit 学习模型不兼容，这些模型假设所有特征都是数值的，而没有丢失值。所以在应用 scikit 学习模型之前，我们需要估算缺失的值。

但是一些新的模型，比如在其他包中实现的 XGboost、LightGBM 和 Catboost，为数据集中丢失的值提供了支持。所以在应用这些模型时，我们不再需要填充数据集中丢失的值。

1.1.4.1 单变量特征插补

假设第 i 列中有缺失值，那么我们将用常数或第 i 列的统计数据（平均值、中值或模式）对其进行估算。

from sklearn.impute import SimpleImputer


test_set = X[0:10,0].copy() # no missing values
# return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


# manully create some missing values
test_set[3] = np.nan
test_set[6] = np.nan
# now sample_columns becomes 
# array([8.3252, 8.3014, 7.2574,    nan, 3.8462, 4.0368,    nan, 3.12 ,2.0804, 3.6912])


# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,0].copy()
train_set[3] = np.nan
train_set[6] = np.nan


imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # use mean
# we can set the strategy to 'mean', 'median', 'most_frequent', 'constant'
imputer.fit(train_set.reshape(-1,1))
result = imputer.transform(test_set.reshape(-1,1)).reshape(-1)
# return array([8.3252    , 8.3014    , 7.2574    , 3.87023658, 3.8462    ,
# 4.0368    , 3.87023658, 3.12      , 2.0804    , 3.6912    ])
# all missing values are imputed with 3.87023658
# 3.87023658 = np.nanmean(train_set) 
# which is the mean of the trainset ignoring missing values

1.1.4.2 多元特征插补

多元特征插补利用整个数据集的信息来估计和插补缺失值。在 scikit-learn 中，它以循环迭代的方式实现。

在每一步中，一个特征列被指定为输出 y，其他特征列被视为输入 X。一个回归器适用于已知 y 的（X，y）。然后，回归器被用来预测 y 的缺失值。这是以迭代的方式对每个特征进行的，然后对最大值插补回合重复进行。

使用线性模型（以 BayesianRidge 为例）：

from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge


test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


# manully create some missing values
test_set[3,0] = np.nan
test_set[6,0] = np.nan
test_set[3,1] = np.nan
# now the first feature becomes 
# array([8.3252, 8.3014, 7.2574,    nan, 3.8462, 4.0368,    nan, 3.12 ,2.0804, 3.6912])


# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,:].copy()
train_set[3,0] = np.nan
train_set[6,0] = np.nan
train_set[3,1] = np.nan


impute_estimator = BayesianRidge()
imputer = IterativeImputer(max_iter = 10, 
                           random_state = 0, 
                           estimator = impute_estimator)


imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252    , 8.3014    , 7.2574    , 4.6237195 , 3.8462    ,
# 4.0368    , 4.00258149, 3.12      , 2.0804    , 3.6912    ])

使用基于树的模型（以 ExtraTrees 为例）：

from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor


test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


# manully create some missing values
test_set[3,0] = np.nan
test_set[6,0] = np.nan
test_set[3,1] = np.nan
# now the first feature becomes 
# array([8.3252, 8.3014, 7.2574,    nan, 3.8462, 4.0368,    nan, 3.12 ,2.0804, 3.6912])


# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,:].copy()
train_set[3,0] = np.nan
train_set[6,0] = np.nan
train_set[3,1] = np.nan


impute_estimator = ExtraTreesRegressor(n_estimators=10, random_state=0)
# parameters can be turned in CV though sklearn pipeline
imputer = IterativeImputer(max_iter = 10, 
                           random_state = 0, 
                           estimator = impute_estimator)


imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252 , 8.3014 , 7.2574 , 4.63813, 3.8462 , 4.0368 , 3.24721,
# 3.12   , 2.0804 , 3.6912 ])

使用 K 近邻（KNN）：

from sklearn.experimental import enable_iterative_imputer # have to import this to enable
# IterativeImputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import KNeighborsRegressor


test_set = X[0:10,:].copy() # no missing values, select all features
# the first columns is
# array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912])


# manully create some missing values
test_set[3,0] = np.nan
test_set[6,0] = np.nan
test_set[3,1] = np.nan
# now the first feature becomes 
# array([8.3252, 8.3014, 7.2574,    nan, 3.8462, 4.0368,    nan, 3.12 ,2.0804, 3.6912])


# create the test samples
# in real-world, we should fit the imputer on train set and tranform the test set.
train_set = X[10:,:].copy()
train_set[3,0] = np.nan
train_set[6,0] = np.nan
train_set[3,1] = np.nan


impute_estimator = KNeighborsRegressor(n_neighbors=10, 
                                       p = 1)  # set p=1 to use manhanttan distance
# use manhanttan distance to reduce effect from outliers


# parameters can be turned in CV though sklearn pipeline
imputer = IterativeImputer(max_iter = 10, 
                           random_state = 0, 
                           estimator = impute_estimator)


imputer.fit(train_set)
result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works
# return array([8.3252, 8.3014, 7.2574, 3.6978, 3.8462, 4.0368, 4.052 , 3.12  ,
# 2.0804, 3.6912])

1.1.4.3 标记估算值

有时，某些缺失值可能是有用的。因此，scikit learn 还提供了将缺少值的数据集转换为相应的二进制矩阵的功能，该矩阵指示数据集中缺少值的存在。

from sklearn.impute import MissingIndicator


# illustrate this function on trainset only
# since the precess is independent in train set and test set
train_set = X[10:,:].copy() # select all features
train_set[3,0] = np.nan # manully create some missing values
train_set[6,0] = np.nan
train_set[3,1] = np.nan


indicator = MissingIndicator(missing_values=np.nan, features='all') 
# show the results on all the features
result = indicator.fit_transform(train_set) # result have the same shape with train_set
# contains only True & False, True corresponds with missing value


result[:,0].sum() # should return 2, the first column has two missing values
result[:,1].sum(); # should return 1, the second column has one missing value

1.1.5 特征变换

1.1.5.1 多项式变换

有时我们希望在模型中引入非线性特征，从而增加模型的复杂度。对于简单的线性模型，这将大大增加模型的复杂度。但是对于更复杂的模型，如基于树的 ML 模型，它们已经在非参数树结构中包含了非线性关系。因此，这种特性转换可能对基于树的 ML 模型没有太大帮助。

例如，如果我们将阶数设置为 3，形式如下：

from sklearn.preprocessing import PolynomialFeatures


# illustrate this function on one synthesized sample
train_set = np.array([2,3]).reshape(1,-1) # shape (1,2)
# return array([[2, 3]])


poly = PolynomialFeatures(degree = 3, interaction_only = False)
# the highest degree is set to 3, and we want more than just intereaction terms


result = poly.fit_transform(train_set) # have shape (1, 10)
# array([[ 1.,  2.,  3.,  4.,  6.,  9.,  8., 12., 18., 27.]])

1.1.5.2 自定义变换

from sklearn.preprocessing import FunctionTransformer


# illustrate this function on one synthesized sample
train_set = np.array([2,3]).reshape(1,-1) # shape (1,2)
# return array([[2, 3]])


transformer = FunctionTransformer(func = np.log1p, validate=True)
# perform log transformation, X' = log(1 + x)
# func can be any numpy function such as np.exp
result = transformer.transform(train_set)
# return array([[1.09861229, 1.38629436]]), the same as np.log1p(train_set)

好了，以上就是关于静态连续变量的数据预处理介绍。建议读者结合代码，在 Jupyter 中实操一遍。

推荐阅读

（点击标题可跳转阅读）

干货 | 公众号历史文章精选

我的深度学习入门路线

我的机器学习入门路线图

重磅！

林轩田机器学习完整视频和博主笔记来啦！

扫描下方二维码，添加 AI有道小助手微信，可申请入群，并获得林轩田机器学习完整视频 + 博主红色石头的精炼笔记（一定要备注：入群 + 地点 + 学校/公司。例如：入群+上海+复旦。