Pretreatment Scikit-learn Preprocessing

This paper is a control preprocessing scikit-learn the section of code in conjunction with a simple pretreatment under review several methods, including standardized, maximum and minimum scaling processing data, regularization, characterized binarized data and missing values.

Mathematical basis

Mean formula:

$$\bar{x}=\frac{1}{n} \sum_{i=1}^{n} x_{i}$$

Variance formula:

$$s^{2}=\frac{1}{n} \sum_{i=1}^{n}\left(x_{i}-\bar{x}\right)^{2}$$

0- norm number of non-zero elements of vector.

1- norm:

$$\|X\|=\sum_{i=1}^{n}\left|x_{i}\right|$$

2-norm:

$$\|X\|_{2}=\left(\sum_{i=1}^{n} x_{i}^{2}\right)^{\frac{1}{2}}$$

p- norm formula:

$$\|X\|_{p}=\left(|x 1|^{p}+|x 2|^{p}+\ldots+|x n|^{p}\right)^{\frac{1}{p}}$$

A standardized (Standardization)

In practice, often overlooked distribution shape feature data, each feature removing mean, standard deviation divided discrete features, such hierarchical, and thus of the data center. (Probability theory problems often do Yongzhe Zhao ah)

However, when the sample value of the individual features or significantly large difference does not comply with the Gaussian normal distribution, the standardized poor performance results.

Formula: (X-X_mean) / X_std calculating for each attribute for each column respectively /.

The data according to their attributes (columns were) minus the mean, and then divided by the variance. Final result is, for each attribute / columns for each of all the data gathered in the vicinity of 0, a variance value.

Method a: Use sklearn.preprocessing.scale () function

from sklearn import preprocessing 
import numpy as np
X = np.array([[ 1., -1.,  2.],
        [ 2.,  0.,  0.],
        [ 0.,  1., -1.]])
X_mean = X.mean(axis=0)  #calculate mean
X_std = X.std(axis=0)   #calculate variance 
X1 = (X-X_mean)/X_std   #standardize X
X_scale = preprocessing.scale(X)  #use function preprocessing.scale to standardize X

Finally X_scale and the value of X1 is the same

Method 2: sklearn.preprocessing.StandardScaler class

from sklearn import preprocessing 
import numpy as np
X = np.array([[ 1., -1.,  2.],
        [ 2.,  0.,  0.],
        [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler()
X_scaled = scaler.fit_transform(X)

These two methods to get the final result is the same.

Second, zoom (MinMaxScaler)

Another common method is scaled to a specified property of maximum and minimum values (typically 1-0) between, which may be achieved by preprocessing.MinMaxScaler class.

The purpose of using this method include:

1, for very small variance can enhance its stability properties;
2, to maintain the sparse matrix entry 0.

Data will be reduced to between 0-1 using function MinMaxScaler

from sklearn import preprocessing 
import numpy as np
X = np.array([[ 1., -1.,  2.],
        [ 2.,  0.,  0.],
        [ 0.,  1., -1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_minMax = min_max_scaler.fit_transform(X)
X_minMax

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

Third, the regularization (Normalization)

Regularization process each sample is scaled to unit norm (Norm 1 per sample), if used as similarity quadratic (dot product), or other method of calculating the nuclear samples between the two The method would be useful.

This method is the basis of the text classification and clustering vector space model (Vector Space Model) is often used.

Normalization is the main idea of which is calculated for each sample p- norm, the norm is then divided by each element in the sample, so that the result of such a treatment is the norm for each post-treatment sample p- (l1-norm, l2-norm) is equal to 1.

Method 1: Use sklearn.preprocessing.normalize () function

from sklearn import preprocessing 
import numpy as np
X = np.array([[ 1., -1.,  2.],
        [ 2.,  0.,  0.],
        [ 0.,  1., -1.]])
X_normalized = preprocessing.normalize(X, norm='l2')
X_normalized

Method 2: sklearn.preprocessing.StandardScaler class

from sklearn import preprocessing 
import numpy as np
X = np.array([[ 1., -1.,  2.],
        [ 2.,  0.,  0.],
        [ 0.,  1., -1.]])
normalizer = preprocessing.Normalizer()
normalizer.transform(X)

Results of two methods:

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

Fourth, the binarization (Binarization)

Wherein the binary data is mainly to characteristic to a boolean variable.

from sklearn import preprocessing 
import numpy as np
X = np.array([[ 1., -1.,  2.],
        [ 2.,  0.,  0.],
        [ 0.,  1., -1.]])
binarizer = preprocessing.Binarizer()
binarizer.transform(X)

Binarizer function may be set a threshold value, the result data is larger than the threshold value 1, the threshold value is smaller than 0, the following sample code:

// Just add a parameter 
binarizer = preprocessing.Binarizer (threshold = 1.1)

Fifth, the missing values

Real datasets with missing values are either empty, or using alternative NaNs or other symbols.

These data can not be used directly scikit-learn classifier direct training, it needs to be processed.

Fortunately, sklearn the Imputer class provides a basic approach to handling missing values, such as the use of the mean, median value or absent where the column replaced frequently occurring.

import numpy as np
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])

X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))

Imputer class also supports sparse matrix (i.e., contains a lot of 0):

import scipy.sparse as sp
X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit(X)

X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
print(imp.transform(X_test))

More please go to the official documentation of view scikit-learn

English version: Preprocessing .
Chinese version: pre-processing data

Reference Links: https://blog.csdn.net/Dream_angel_Z/article/details/49406573