Data preprocessing is the first step in data analysis, how to get clean data is a prerequisite for the analysis of the effect.
Today learned several entry-level method of data preprocessing, notes it! Tools: python.sklearn
1, line normalization / regularization Normalizer
So that the square of each row is 1, the text used in classification and clustering
z=pd.DataFrame({"a":[2.,1.,6.],"b":[3.,0,2.]})
. 1 from sklearn.preprocessing Import Normalizer 2 Normalizer (). Fit_transform (Z) . 3 # equivalent function call sklearn three steps . 4 A = Normalizer () # instantiates . 5 a.fit (Z) # model fit . 6 A. Transform (Z) # conversion
z normalization of the latter is
2, the column normalized / standardized / dimensionless Standardscaler
This method requires approximately Gaussian distribution data, the data is normalized with mean 0 and variance 1
1 from sklearn.preprocessing import StandardScaler 2 StandardScaler().fit_transform(z)
3, range zoom / change poor / non-dimensional
The data is mapped to [0,1], but new data is added, effects the maximum / minimum values, and thus need to redefine a distance metric design a machine learning method is not applicable
# Interval scaling / Range transformation / dimensionless from sklearn.preprocessing Import MinMaxScaler MinMaxScaler (). Fit_transform (Z)
4, wherein binarization
Setting a threshold value, the threshold value is greater than 1, less than or equal to the threshold value of 0 ,. Available for processing binary classification problem target vector
# Feature binarization from sklearn.preprocessing Import Binarizer Binarizer (threshold = 1) .fit_transform (Z) # Threshold set to 1
5, one-hot encoding
Classification is often the default data is continuous and orderly, but many features are discrete. Thus, all the different values of the discrete features a single column, wherein 1 represents a discrete value for this column, 0 represents a discrete value for this feature is not listed
# Hot encoded Z3 = pd.DataFrame ({ " A " : [ " M " , " F " , " M " , " F " ], " B " : [ " first year " , " second year " , " Great three " , " first year " ]}) from sklearn.preprocessing Import OneHotEncoder ENC = OneHotEncoder (= the Categories "auto") enc.fit(z3) ANS = enc.transform ([[ " M " , " first year " ]]). toArray () # toArray () can be converted to an array of display OneHotEncoder (the Categories = " Auto " ) .fit_transform (Z3) .toArray ( )
The embodiment, generated [ "F", "M", "first year", "junior", "Big"] matrix column name 4 * 5
by
Changes to
6, the missing value calculation
The main filling of missing values
. 1 Z5 = pd.DataFrame ({ " A " : [l, 5, np.nan], " B " : [np.nan, 3,5], " C " : [l, 2,3 ]}) 2 from sklearn.impute Import SimpleImputer . 3 SimpleImputer (). fit_transform (Z5) # default value instead of mean missing . 4 SimpleImputer (Strategy = ' Constant ' ) .fit_transform (Z5) # parameters defined by the place of the missing value 0
7, creating a characteristic polynomial
a, b two characteristics, it is a quadratic polynomial 1, a, b, a ^ 2, b ^ 2, ab
# Polynomial construct wherein from sklearn.preprocessing Import PolynomialFeatures P1 = PolynomialFeatures (Degree = 2, include_bias = False, interaction_only = False) # produce a characteristic square terms and cross terms P2 = p1.fit_transform (Z) p2_df = pd.DataFrame ( p2, p1.get_feature_names columns = ()) # is the column name to increase p2 p2_df X0 X0 X1 ^ 2 ^ 2 X1 X1 X0 0 2.0 3.0 4.0 6.0 9.0 . 1 1.0 0.0 1.0 0.0 0.0 2 6.0 2.0 4.0 36.0 12.0
figthing!