Data preprocessing and feature a simple construction

Reference herein "Data Scientist Union" cookies article.

First, the dimensionless: The most value normalized mean-variance normalization and sklearn in Scaler

In the case of different dimensions, when the sample does not reflect the degree of importance of each feature, you will need to use a normalization method.

In general solution for all of the data are mapped to the same dimension (dimension) on.

1, commonly used data normalization, there are two:

Best value for normalizing (Normalization) :

 All data mapped between 0-1. Most values ​​were normalized using the distribution range having distinct boundaries (score 0 to 100, 0 to 255 gradation), affected by relatively large outlier

Mean-variance normalization (standardization):

 All the data were normalized to zero mean and variance of the distribution 1. Applied to the data no clear boundaries, there may be a case of extreme data values.

 

 

 2, sklearn the Scaler

To modeling data set into training data set & test data set.

The training data set normalized, mean_train need to compute the mean and variance of the training data set std_train.

The question is: When we test data sets are normalized to calculate the mean and variance of the test data it?

the answer is negative. When the test data set is normalized, still have to use the training dataset train_mean mean and variance std_train. This is because the test data is to simulate the real environment, the environment may not be real mean and variance, the data were normalized. Only able to use the formula (x_test - mean_train) / std_trainand, also part of the data normalization algorithm, for all the data back, you should do the same process.

So we want to save the mean and variance of the training data set obtained.

In the method sklearn specialized for data normalization: StandardScaler.

# ## downloading datasets 

Import numpy AS NP
 from sklearn Import Datasets
 from sklearn.model_selection Import train_test_split 

IRIS = datasets.load_iris () 
X- = iris.data 
Y = iris.target 
X_train, X_test, y_train, android.permission.FACTOR. = Train_test_split (iris.data , iris.target, test_size = 0.2, random_state = 666 ) 

# ## normalized 
from sklearn.preprocessing Import StandardScaler 
standardScaler = StandardScaler ()
 # normalization process with the same training model
standardScaler.fit (X_train) 
standardScaler.mean_ 
standardScaler.scale_    # representation of variable data distribution, alternatively std_ 

# use Transform 
X_train_standard = standardScaler.transform (X_train) 
X_test_standard = standardScaler.transform (X_test)

 

Second, the missing values

1, it is determined range of missing values

Each field representation of calculated value of the scale deletions, and deletions according to the proportion of importance and fields, respectively develop strategies available to the following:

 

2, removing unneeded fields

Each step of cleaning is recommended to back up what to do, or data on a small-scale trial is successful then the full amount of data processing.

3, fill in missing content

1) Artificial filled (filling manually)

Artificial be filled according to the service information.

2) special value filled (Treating Missing Attribute values ​​as Special values)

A null value as a special attribute values ​​processing, which is different from any other attribute values. As with all null values ​​are "unknown" filled. General procedure as intermediate or temporary filling.

df['Feature'].fillna('unknown', inplace=True)

3) filled with statistics

If the miss rate is low (less than 95%) and lesser importance, the filling according to the situation of data distribution.

Common filling statistic:
  • average value:

    For the data are consistent with a uniform distribution, to impute missing values ​​with the mean of the variable.

  • Median:

    In the case of skewed distribution data exists, medians impute missing values.

  • 众数:

    Discrete features may be used to fill a mode for missing values.

Average filling method:

The initial data set of attributes and attribute values ​​into non-numerical attributes to be processed separately.

# In Case pandas library operation 
the display (df.head (10 ))
 # pre-filling data 
  Featurel Feature2 the Label 
0     1.0 A. 1 
. 1. 1 2.0 A 
2 A 3.0. 1 
. 3 C 4.0. 1 
. 4 A NaN3. 1 
. 5 2.0            None 0
 . 6 3.0            B 0
 . 7 3.0            None 0
 . 8     NaN3 B 0
 . 9     NaN3 B 0 

# mean filling 
DF [ ' Featurel ' ] .fillna (DF [ ' Featurel' ] .Mean (), InPlace = True) 

# median filled 
DF [ ' Feature2 ' ] .fillna (DF [ ' Feature2 ' ] .mode (). ILoc [0], InPlace = True) 

the display (df.head ( 10 ))
 # after filling data 
    Featurel Feature2 the Label 
0     1.000000 A. 1 
. 1. 1 A 2.000000 
2 3.000000 A. 1 
. 3 4.000000 C. 1 
. 4 2.571429 A. 1 
. 5 2.000000     A 0
 . 6 3.000000     B 0
 . 7 3.000000     A 0
8    2.571429    B            0
9    2.571429    B            0
Conditions average filling method (Conditional Mean Completer):

In this method, for averaging / a mode / not taken from the median of all objects in the data set, but the decision obtained from the subject with the same attribute values ​​of the object.

# 条件平均值填充
def condition_mean_fillna(df, label_name, feature_name):
    mean_feature_name = '{}Mean'.format(feature_name)
    group_df = df.groupby(label_name).mean().reset_index().rename(columns={feature_name: mean_feature_name})

    df = pd.merge(df, group_df, on=label_name, how='left')
    df.loc[df[feature_name].isnull(), feature_name] = df.loc[df[feature_name].isnull(), mean_feature_name]
    df.drop(mean_feature_name, inplace=True, axis=1)
    return df

df = condition_mode_fillna(df, 'Label', 'Feature2')

4) filled with model predictions

Use the fields to be filled as Label, no missing data as training data, establish classification / regression models, treat-filled missing fields to predict and populated.

Distance nearest neighbor (KNN)

The Euclidean distance is determined first or correlation analysis with missing data samples from the most recent K samples, a weighted average of these K values ​​/ vote to estimate the missing data sample.

from sklearn.neighbors Import KNeighborsClassifier, KNeighborsRegressor 

DEF knn_missing_filled (x_train, y_train, Test, K =. 3, Dispersed = True):
     '' ' 
    @param x_train: no missing data set values 
    @param y_train: the missing value fields to be filled 
    @param test: missing values to be filled data set 
    '' ' 
    IF Dispersed: 
        CLF = KNeighborsClassifier (N_NEIGHBORS = K, weights = " Distance " )
     the else : 
        CLF = KNeighborsRegressor (N_NEIGHBORS = K, weights = " Distance " ) 

    clf.fit (x_train, y_train) 
    return test.index, clf.predict(test)
Regression (Regression)

Based on the complete data set, the regression equation. Contains a null value for the object, the known property values ​​into the equation to estimate the unknown property value, the estimated value in order to be filled. When the variables are not linearly related to lead to biased estimates. Commonly used linear regression.

4) re-access

If some of the indicators is very important and missing rate, it would need to take and the number of personnel or business people understand, if there are other channels to get to the relevant data.

 

Third, the type classification process wherein: the dummy variable coding

Reference Links: https://www.cnblogs.com/juanjiang/archive/2019/05/30/10948849.html

In machine learning, most algorithms, such as logistic regression, support vector machines SVM, k nearest neighbor algorithm can process only numeric data, text can not be processed in sklearn which, in addition to a dedicated word processing algorithm, other algorithms fit in when all required input array or matrix, not be able to import character data type (and in fact Staples Bayesian decision trees handwritten text can be processed, but provisions must be introduced sklearn numeric). However, in reality, many labels and features at the time of data collection is completed, the numbers are not based on performance. For example, the value of qualifications can be [ "primary school", "junior", "high school", "university"], payment may contain [ "Alipay", "cash", "micro letter"] and so on. In this case, in order for the data to adapt algorithms and libraries, we have the data be encoded , that is, convert text to numeric data type .

 preprocessing.LabelEncoder : tag-specific, can be converted to classify numerical classification

preprocessing.OrdinalEncoder : special feature can be converted to the classification classification characteristic value

preprocessing.OneHotEncoder : one-hot encoding, create a dummy variable

 

Fourth, the continuous process wherein: the binary segment

  • sklearn.preprocessing.Binarizer

  The threshold binary data (feature value to 0 or 1), for processing the continuous variables. Greater than the threshold value map is 1, and less than or equal to the threshold value 0 is mapped to. The default threshold value is 0, all features are mapped to a positive value. Binarization is common operations on text count data, analysts can determine only consider the existence of a phenomenon or not. It can also be considered as a random variable of the Boolean estimator preprocessing steps (e.g., using Bernoulli distribution Bayesian modeling settings).

 

Guess you like

Origin www.cnblogs.com/zym-yc/p/11964148.html