[Reprint] Sharing of nine preprocessing methods commonly used in python

Original link: http://www.jb51.net/article/92408.htm

This article summarizes the common data preprocessing methods we all use in python. The following is an introduction to the preprocessing module of sklearn;

1. 标准化(Standardization or Mean Removal and Variance Scaling)

After transformation, each dimension feature has 0 mean and unit variance. Also called z-score normalization (zero mean normalization). It is calculated by subtracting the mean from the eigenvalues ​​and dividing by the standard deviation.

?
1
sklearn.preprocessing.scale(X)

Generally, the train and test sets are put together for standardization, or after standardization on the train set, the same normalizer is used to standardize the test set. In this case, the scaler can be used.

?
1
2
3
scaler = sklearn.preprocessing.StandardScaler().fit(train)
scaler.transform(train)
scaler.transform(test)

In practical applications, common scenarios that require feature standardization: SVM

2. Min-Max Normalization

Min-max normalization linearly transforms the original data into the [0,1] interval (or other fixed min-max intervals)

?
1
2
min_max_scaler = sklearn.preprocessing.MinMaxScaler()
min_max_scaler.fit_transform(X_train)

3. Normalization

Normalization is to map the values ​​of different varying ranges to the same fixed range, the common one is [0,1], which is also called normalization at this time.

Transform each sample into unit norm.

?
1
2
X = [[ 1 , - 1 , 2 ],[ 2 , 0 , 0 ], [ 0 , 1 , - 1 ]]
sklearn.preprocessing.normalize(X, norm = 'l2' )

get:

?
1
array([[ 0.40 , - 0.40 , 0.81 ], [ 1 , 0 , 0 ], [ 0 , 0.70 , - 0.70 ]])

It can be found that for each sample, 0.4^2+0.4^2+0.81^2=1, which is the L2 norm, and the sum of squares of the dimensional features of each sample after transformation is 1. Similarly, L1 norm means that the sum of the absolute values ​​of the dimensional features of each sample after transformation is 1. There is also max norm, which is to divide the dimensional features of each sample by the maximum value of the dimensional features of the sample.
When measuring the similarity between samples, if a quadratic kernel is used, normalization needs to be done

4. Binarization

Convert features to 0/1 given a threshold

?
1
2
binarizer = sklearn.preprocessing.Binarizer(threshold = 1.1 )
binarizer.transform(X)

5. Label binarization

?
1
lb = sklearn.preprocessing.LabelBinarizer()

6. Category feature encoding

Sometimes the features are categorical, and the input to some algorithms must be numeric, which needs to be encoded.

?
1
2
3
enc = preprocessing.OneHotEncoder()
enc.fit([[ 0 , 0 , 3 ], [ 1 , 1 , 0 ], [ 0 , 2 , 1 ], [ 1 , 0 , 2 ]])
enc.transform([[ 0 , 1 , 3 ]]).toarray() #array([[ 1., 0., 0., 1., 0., 0., 0., 0., 1.]])

In the above example, the first dimension feature has two values, 0 and 1, which are encoded with two bits. Use three digits for the second dimension and four digits for the third dimension.

another way of encoding

?
1
newdf = pd.get_dummies(df,columns = [ "gender" , "title" ],dummy_na = True )

7. Label encoding

?
1
2
3
4
5
6
le = sklearn.preprocessing.LabelEncoder()
le.fit([ 1 , 2 , 2 , 6 ])
le.transform([ 1 , 1 , 2 , 6 ]) #array([0, 0, 1, 2])
#非数值型转化为数值型
le.fit([ "paris" , "paris" , "tokyo" , "amsterdam" ])
le.transform([ "tokyo" , "tokyo" , "paris" ]) #array([2, 2, 1])

8. When the feature contains outliers

?
1
sklearn.preprocessing.robust_scale

9. Generating polynomial features

This actually involves feature engineering, polynomial features/cross features.

?
1
2
poly = sklearn.preprocessing.PolynomialFeatures( 2 )
poly.fit_transform(X)

Original Features:

After conversion:

Summarize

The above is a summary of the nine common preprocessing methods in python. I hope this article can help you to learn or use python. If you have any questions, you can leave a message to exchange.

Translating...

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325745635&siteId=291194637