数值类数据处理
MinMaxScaler
方便机器算法的performance
MinMaxScaler
# Load libraries
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Create feature
feature = np.array([[-500.5],
[-100.1],
[0],
[100.1],
[900.9]])
# Create scaler
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
# Scale feature
scaled_feature = minmax_scale.fit_transform(feature)
# Show feature
scaled_feature
Standardizing
StandardScaler
均值0,方差为1
import numpy as np
from sklearn import preprocessing
# Create feature
x = np.array([[-1000.1],
[-200.2],
[500.5],
[600.6],
[9000.9]])
#Create scaler
scaler = preprocessing.StandardScaler()
# Transform the feature
standardized = scaler.fit_transform(x)
# Show feature
standardized
如果有outliers 那种异常值,那就偶用RobustScaler
Normalizing
Normalizer
注意 norm = L2, 表示的是l2范式,说白了就是,二次方和的root
unit norm of length 1
# Load libraries
import numpy as np
from sklearn.preprocessing import Normalizer
# Create feature matrix
features = np.array([[0.5, 0.5],
[1.1, 3.4],
[1.5, 20.2],
[1.63, 34.4],
[10.9, 3.3]])
# Create normalizer
normalizer = Normalizer(norm="l2")
# Transform feature matrix
normalizer.transform(features)
Generating Polynomial and Interaction Features
# Load libraries
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
# Create feature matrix
features = np.array([[2, 3],
[2, 3],
[2, 3]])
# Create PolynomialFeatures object
polynomial_interaction = PolynomialFeatures(degree=2, include_bias=False)
# Create polynomial features
polynomial_interaction.fit_transform(features)
array([[ 2., 3., 4., 6., 9.],
[ 2., 3., 4., 6., 9.],
[ 2., 3., 4., 6., 9.]])
#We can restrict the features created to only #interaction features by setting interac
#tion_only to True:
#interaction = PolynomialFeatures(degree=2,
#interaction_only=True, include_bias=False)
#interaction.fit_transform(features)
transform features
你想对features操作自己的函数
import numpy as np
from sklearn.preprocessing import
FunctionTransformer
# Create feature matrix
features = np.array([[2, 3],
[2, 3],
[2, 3]])
# Define a simple function
def add_ten(x):
return x + 10
# Create transformer
ten_transformer = FunctionTransformer(add_ten)
# Transform feature matrix
ten_transformer.transform(features)
或者用pandas!!!!
重点应用这个
import pandas as pd
# Create DataFrame
df = pd.DataFrame(features, columns=["feature_1", "feature_2"])
# Apply function
df.apply(add_ten)
检测异常值(Detecting Outliers)
一个常用的方式使用 椭圆, 椭圆里面的就是Inlier并且标记为1
再椭圆外面的都是outlier并且标记为 1
另外一种就是用,IQR(interquartile range)
还有一种用之前说过的RobustScaler
离散数据( discretize features)
You have a numerical feature and want to break it up into discrete bins.
Binarizer() digitize()
import numpy as np
from sklearn.preprocessing import Binarizer
# Create binarizer
binarizer = Binarizer(18)
# Transform feature
binarizer.fit_transform(age)
Second, we can break up numerical features according to multiple thresholds:
np.digitize(age, bins=[20,30,64])
删除缺失值
# Load library
import pandas as pd
# Load data
dataframe = pd.DataFrame(features, columns=["feature_1", "feature_2"])
# Remove observations with missing values
dataframe.dropna()
填充缺失值(imputing missing data)
导入 Imputer
# Load library
from sklearn.preprocessing import Imputer
# Create imputer
mean_imputer = Imputer(strategy="mean", axis=0)
# Impute values
Datafeatures_mean_imputed = mean_imputer.fit_transform(features)
总之,此章节只要讲 数值型数据处理