python data preprocessing (entry)

Data preprocessing is the first step in data analysis, how to get clean data is a prerequisite for the analysis of the effect.

Today learned several entry-level method of data preprocessing, notes it! Tools: python.sklearn

1, line normalization / regularization Normalizer

So that the square of each row is 1, the text used in classification and clustering

z=pd.DataFrame({"a":[2.,1.,6.],"b":[3.,0,2.]})

. 1  from sklearn.preprocessing Import Normalizer
 2  Normalizer (). Fit_transform (Z)
 . 3  # equivalent function call sklearn three steps 
. 4 A = Normalizer () # instantiates 
. 5 a.fit (Z) # model fit 
. 6 A. Transform (Z) # conversion

z normalization of the latter is

2, the column normalized / standardized / dimensionless Standardscaler

This method requires approximately Gaussian distribution data, the data is normalized with mean 0 and variance 1

1 from sklearn.preprocessing import StandardScaler
2 StandardScaler().fit_transform(z)

3, range zoom / change poor / non-dimensional

The data is mapped to [0,1], but new data is added, effects the maximum / minimum values, and thus need to redefine a distance metric design a machine learning method is not applicable

# Interval scaling / Range transformation / dimensionless 
from sklearn.preprocessing Import MinMaxScaler 
MinMaxScaler (). Fit_transform (Z)

4, wherein binarization

Setting a threshold value, the threshold value is greater than 1, less than or equal to the threshold value of 0 ,. Available for processing binary classification problem target vector

# Feature binarization 
from sklearn.preprocessing Import Binarizer 
Binarizer (threshold = 1) .fit_transform (Z) # Threshold set to 1

5, one-hot encoding

Classification is often the default data is continuous and orderly, but many features are discrete. Thus, all the different values ​​of the discrete features a single column, wherein 1 represents a discrete value for this column, 0 represents a discrete value for this feature is not listed

# Hot encoded 
Z3 = pd.DataFrame ({ " A " : [ " M " , " F " , " M " , " F " ], " B " : [ " first year " , " second year " , " Great three " , " first year " ]})
 from sklearn.preprocessing Import OneHotEncoder 
ENC = OneHotEncoder (= the Categories "auto")
enc.fit(z3)
ANS = enc.transform ([[ " M " , " first year " ]]). toArray () # toArray () can be converted to an array of display 
OneHotEncoder (the Categories = " Auto " ) .fit_transform (Z3) .toArray ( )

The embodiment, generated [ "F", "M", "first year", "junior", "Big"] matrix column name 4 * 5

by

Changes to

6, the missing value calculation

The main filling of missing values

. 1 Z5 = pd.DataFrame ({ " A " : [l, 5, np.nan], " B " : [np.nan, 3,5], " C " : [l, 2,3 ]})
 2  from sklearn.impute Import SimpleImputer
 . 3 SimpleImputer (). fit_transform (Z5) # default value instead of mean missing 
. 4 SimpleImputer (Strategy = ' Constant ' ) .fit_transform (Z5) # parameters defined by the place of the missing value 0

7, creating a characteristic polynomial

a, b two characteristics, it is a quadratic polynomial 1, a, b, a ^ 2, b ^ 2, ab

# Polynomial construct wherein 
from sklearn.preprocessing Import PolynomialFeatures 
P1 = PolynomialFeatures (Degree = 2, include_bias = False, interaction_only = False) # produce a characteristic square terms and cross terms 
P2 = p1.fit_transform (Z) 
p2_df = pd.DataFrame ( p2, p1.get_feature_names columns = ()) # is the column name to increase p2 
p2_df 

     X0 X0 X1 ^ 2 ^ 2 X1 X1 X0 
0     2.0 3.0 4.0 6.0 9.0 
. 1 1.0 0.0 1.0 0.0 0.0 
2 6.0 2.0 4.0 36.0 12.0

figthing!

 

Guess you like

Origin www.cnblogs.com/dahongbao/p/11072057.html