Machine learning tool scikit-learn under Python (-data preprocessing)
Referenced from http://lib.csdn.net/article/machinelearning/1119 and http://scikit-learn.org/stable/modules/preprocessing.html
(1) Data standardization (Standardization or Mean Removal and Variance Scaling)
Standardized scaled data have a mean of 0 and unit variance.
#scale函数提供一种便捷的标准化转换操作,如下:
from sklearn import preprocessing #导入数据预处理包
X=[[1.,-1.,2.],
[2.,0.,0.],
[0.,1.,-1.]]
X_scaled=preprocessing.scale(X)
X_scaled
array([[ 0. , -1.22474487, 1.33630621],
[ 1.22474487, 0. , -0.26726124],
[-1.22474487, 1.22474487, -1.06904497]])
X_scaled.mean(axis=0)#preprocessing.scale()方法默认是按0轴(x坐标)缩放的
array([ 0., 0., 0.])
X_scaled.std(axis=0)
array([ 1., 1., 1.])
Similarly, we can also implement this function through the Scaler (StandardScaler 0.15 and later) tool class provided by the preprocessing module:
scaler=preprocessing.StandardScaler().fit(X)
scaler
StandardScaler(copy=True, with_mean=True, with_std=True)
scaler.mean_
array([ 1. , 0. , 0.33333333])
scaler.scale_#scaler.std_ will be removed in 0.19. Use ``scale_`` instead
array([ 0.81649658, 0.81649658, 1.24721913])
scaler.transform(X)
array([[ 0. , -1.22474487, 1.33630621],
[ 1.22474487, 0. , -0.26726124],
[-1.22474487, 1.22474487, -1.06904497]])
(2) Data normalization (Normalization)
Scale the matrix to the interval [0,1]
import numpy as np
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
min_max_scaler=preprocessing.MinMaxScaler()
X_train_minmax=min_max_scaler.fit_transform(X_train)#每个元素减去行最小值然后除以每行最大值与最小值的差
X_train_minmax
array([[ 0.5 , 0. , 1. ],
[ 1. , 0.5 , 0.33333333],
[ 0. , 1. , 0. ]])
#上面的min_max_scaler可以用来适配新的数据
X_test=np.array([[-3.,-1.,4.]])
X_test_minmax=min_max_scaler.transform(X_test)
X_test_minmax
array([[-1.5 , 0. , 1.66666667]])
min_max_scaler.scale_
array([ 0.5 , 0.5 , 0.33333333])
min_max_scaler.min_
array([ 0. , 0.5 , 0.33333333])
If MinMaxScaler is given an explicit feature_range = (min, max), the full formula is:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std / (max - min) + min
MaxAbsScaler works in a very similar way, but by dividing the largest maximum value in each feature, the training data lies in the range [-1,1]. It is for data that is already centered around zero or sparse data.
X_train = np.array([[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]])
max_abs_scaler=preprocessing.MaxAbsScaler()#是除以最大值的绝对值
X_train_maxabs=max_abs_scaler.fit_transform(X_train)
X_train_maxabs
array([[ 0.5, -1. , 1. ],
[ 1. , 0. , 0. ],
[ 0. , 1. , -0.5]])
X_test=np.array([[-3.,-1.,4.]])
X_test_maxabs=max_abs_scaler.transform(X_test)
X_test_maxabs
array([[-1.5, -1. , 2. ]])
max_abs_scaler.scale_
array([ 2., 1., 2.])
Scaling with the mean and variance of the data may not work correctly if the data contains many outliers. In these cases, robust_scale and RobustScaler can be used. They use stronger estimates of the center and extent of the data
Scale all values of each sample in the dataset to be between (-1,1).
X=[[1.,-1.,2.],
[2.,0.,0.],
[0.,1.,-1.]]
X_normalized=preprocessing.normalize(X,norm='l2')#norm l1 l2 默认axis=1,按行
#norm=l2 相当于每个元素除以根号下整行元素的平方和
#norm=l1 X np.abs(X).sum(axis=1)每个元素除以整行元素绝对值的和
#norm=max 每个元素除以整行元素的最大值
X_normalized
array([[ 0.40824829, -0.40824829, 0.81649658],
[ 1. , 0. , 0. ],
[ 0. , 0.70710678, -0.70710678]])
normalizer=preprocessing.Normalizer().fit(X)#fit does nothing
normalizer
Normalizer(copy=True, norm='l2')
normalizer.transform(X)
array([[ 0.40824829, -0.40824829, 0.81649658],
[ 1. , 0. , 0. ],
[ 0. , 0.70710678, -0.70710678]])
normalizer.transform([[-1.,1.,0.]])
array([[-0.70710678, 0.70710678, 0. ]])
(3) Binarization
Convert numeric data to Boolean binary data, you can set a threshold (threshold)
X=[[1.,-1.,2.],
[2.,0.,0.],
[0.,1.,-1.]]
binarizer=preprocessing.Binarizer().fit(X)#fit does nothing
binarizer #默认阈值为0.0
Binarizer(copy=True, threshold=0.0)
binarizer.transform(X)
array([[ 1., 0., 1.],
[ 1., 0., 0.],
[ 0., 1., 0.]])
binarizer=preprocessing.Binarizer(threshold=1.1)#设置阈值为1.1
binarizer.transform(X)
array([[ 0., 0., 1.],
[ 1., 0., 0.],
[ 0., 0., 0.]])
(4) Label preprocessing
4.1) Label binarization
LabelBinarizer is typically used to create a label indicator matrix from a list of multi-class labels
lb=preprocessing.LabelBinarizer()
lb.fit([1,2,6,4,2])
LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
lb.classes_
array([1, 2, 4, 6])
lb.transform([1,6])
array([[1, 0, 0, 0],
[0, 0, 0, 1]])
4.2) Label encoding
le=preprocessing.LabelEncoder()
le.fit([1,2,2,6])
LabelEncoder()
le.classes_
array([1, 2, 6])
le.transform([1,1,2,6])#编码
array([0, 0, 1, 2], dtype=int64)
le.inverse_transform([0,0,1,2])#解码
array([1, 1, 2, 6])
It can also be used to convert non-numeric labels to numeric labels:
le=preprocessing.LabelEncoder()
le.fit(["paris","paris","tokyo","amsterdam"])
LabelEncoder()
list(le.classes_)
['amsterdam', 'paris', 'tokyo']
le.transform(["tokyo","tokyo","paris"])
array([2, 2, 1], dtype=int64)
le.inverse_transform([0,2,1])
array(['amsterdam', 'tokyo', 'paris'],
dtype='<U9')
(5) Encoding classification feature OneHotEncoder
Usually features are not continuous values, but are like gender: ["male", "female"], region: ["from Europe", "from US", "from Asia"], browser: ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"] category or field value. Such features can be efficiently encoded as integers.
["male", "from US", "uses Internet Explorer"] is encoded as [0, 1, 3]
[“female”, “from Asia”, “uses Chrome”]则为[1, 2, 1]
But such features cannot be put directly into machine learning algorithms
For the above problem, the attribute of gender is two-dimensional. Similarly, the region is three-dimensional, and the browser is thinking. In this way, we can use the One-Hot encoding method for the above sample "["male", " US", "Internet Explorer"]" encoding, "male" corresponds to [1, 0], similarly "US" corresponds to [0, 1, 0], "Internet Explorer" corresponds to [0, 0, 0] ,1]. Then the result of the complete feature digitization is: [1,0,0,1,0,0,0,0,1]. One consequence of this is that the data becomes very sparse.
enc=preprocessing.OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
I understand the above code like this, treat each column of the matrix as a feature for encoding, for example, the first column is [0 1 0 1], which has two different elements 0 and 1, so two dimensions are required for encoding , encoded as [0 1] and [1 0], respectively.
Similarly, the elements in the second column are [0 1 2 0], and there are 3 different elements, namely 0 1 2, so three dimensions are required for encoding, which are encoded as [1 0 0][0 1 0][0 0 1]
The third column element is [3 0 1 2], there are four different elements, which are 0 1 2 3, so four dimensions are needed for encoding, which are encoded as [1 0 0 0][0 1 0 0][ 0 0 1 0][0 0 0 1]
enc.transform([[1,2,0]]).toarray()
array([[ 0., 1., 0., 0., 1., 1., 0., 0., 0.]])
enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
# Note that there are missing categorical values for the 2nd and 3rd
# features
enc.fit([[1, 2, 3], [0, 2, 0]])
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])
enc.transform([[1, 0, 0]]).toarray()
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])
(6) Estimate missing values
The following code snippet demonstrates how to replace missing values coded as np.nan with the mean of the column (axis 0) containing missing values:
imp=preprocessing.Imputer(missing_values='NaN',strategy='mean',axis=0)
imp.fit([[1,2],[np.nan,3],[7,6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
X=[[np.nan,2],[6,np.nan],[7,6]]
imp.transform(X)
array([[ 4. , 2. ],
[ 6. , 3.66666667],
[ 7. , 6. ]])
The Imputer class also supports sparse matrices
import scipy.sparse as sp
X=sp.csc_matrix([[1,2],[0,3],[7,6]])
imp=preprocessing.Imputer(missing_values=0,strategy='mean',axis=0)
imp.fit(X)
Imputer(axis=0, copy=True, missing_values=0, strategy='mean', verbose=0)
X_test=sp.csc_matrix([[0,2],[6,0],[7,6]])
imp.transform(X_test)
array([[ 4. , 2. ],
[ 6. , 3.66666667],
[ 7. , 6. ]])
(7) Generating polynomial features
X=np.arange(6).reshape(3,2)
X
array([[0, 1],
[2, 3],
[4, 5]])
poly=preprocessing.PolynomialFeatures(degree=2)
poly.fit_transform(X)
#X的特征从 (X_1, X_2) 转换为了(1, X_1, X_2, X_1^2, X_1X_2, X_2^2).
array([[ 1., 0., 1., 0., 0., 1.],
[ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])
Using interaction_only=True, the intersection between elements will be preserved, and the square cubed item will be removed
X = np.arange(9).reshape(3, 3)
X
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
poly=preprocessing.PolynomialFeatures(degree=3,interaction_only=True)
poly.fit_transform(X)
#(X_1, X_2, X_3) to (1, X_1, X_2, X_3, X_1X_2, X_1X_3, X_2X_3, X_1X_2X_3).
array([[ 1., 0., 1., 2., 0., 0., 2., 0.],
[ 1., 3., 4., 5., 12., 15., 20., 60.],
[ 1., 6., 7., 8., 42., 48., 56., 336.]])
(8) Custom conversion
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p)
X = np.array([[0, 1], [2, 3]])
transformer.transform(X)
array([[ 0. , 0.69314718],
[ 1.09861229, 1.38629436]])