Learn feature processing methods such as feature preprocessing, missing values, outlier processing, and data bucketing
Learn corresponding methods for feature interaction, encoding, and selection
Introduction
Data preprocessing:
filling of missing values
Time Format Handling
object type trait conversion to value
Outlier handling:
Based on the 3segama principle
Based on box plot
Data binning:
Fixed width binning
quantile fraction box
Binning discrete numeric data
Continuous numeric data binning
Chi-square binning
Feature interaction:
Combination of features and features
Derivation between features and features
Other attempts at feature derivation
Feature encoding:
one-hot encoding
label-encode encoding
Feature selection:
Filter
Wrapper
Embedded
code example
Import the package and read the data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
import warnings
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
warnings.filterwarnings('ignore')
for data in[train,testA]:
data['earliesCreditLine']= data['earliesCreditLine'].apply(lambda s:int(s[-4:]))
Class feature processing
cate_features =['grade','subGrade','employmentTitle','homeOwnership','verificationStatus','purpose','postCode','regionCode','applicationType','initialListStatus','title','policyCode']for f in cate_features:print(f,'类型数:', data[f].nunique())
Outlier handling
Method to detect anomalies: mean square error
deffind_outliers_by_3segama(data,fea):
data_std = np.std(data[fea])
data_mean = np.mean(data[fea])
outliers_cut_off = data_std*3
lower_rule = data_mean - outliers_cut_off
upper_rule = data_mean + outliers_cut_off
data[fea+'_outliers']= data[fea].apply(lambda x:str('异常值')if x > upper_rule or x < lower_rule else'正常值')return data
data_train = data_train.copy()for fea in numerical_fea:
data_train = find_outliers_by_3segama(data_train,fea)print(data_train[fea+'_outliers'].value_counts())print(data_train.groupby(fea+'_outliers')['isDefault'].sum())print('*'*10)
From the model effect point of view, feature binning is mainly to reduce the complexity of variables, reduce the impact of variable noise on the model, and improve the correlation between independent variables and dependent variables. This makes the model more stable.
The object of data bucketing:
Discretize continuous variables
Merge multi-state discrete variables into few-state
Reasons for binning:
The value span in the feature of the data may be relatively large. For supervised and unsupervised methods such as k-means clustering, it uses Euclidean distance as a similarity function to measure the similarity between data points. Both will cause big and small impacts. One of the solutions is to intervalize the count value, that is, data bucketing, also known as data binning, and then use the quantified results.
Advantages of binning:
Handle missing values: When there may be missing values in the data source, null can be used as a separate bin at this time.
Dealing with outliers: When there are outliers in the data, they can be discretized by binning to improve the robustness (anti-interference ability). For example, if there is an abnormal value of 200 in age, it can be classified into the bin of "age > 60" to exclude the influence.
Business explanation: We are used to linearly judging the role of variables. When x gets bigger and bigger, y gets bigger and bigger. However, there is often a nonlinear relationship between x and y, which can be transformed by WOE at this time.
Pay special attention to the basic principles of binning:
#label-encode:subGrade,postCode,title# 高维类别特征需要进行转换for col in tqdm(['employmentTitle','postCode','title','subGrade']):
le = LabelEncoder()
le.fit(list(data_train[col].astype(str).values)+list(data_test_a[col].astype(str).values))
data_train[col]= le.transform(list(data_train[col].astype(str).values))
data_test_a[col]= le.transform(list(data_test_a[col].astype(str).values))print('Label Encoding 完成')
feature selection
Filter a. Variance selection method b. Correlation coefficient method (pearson correlation coefficient) c. Chi-square test d. Mutual information method
Wrapper (RFE) a. Recursive feature elimination method
Embedded a. Feature selection method based on penalty item b. Feature selection based on tree model