Insight Trend Series (2) Feature Engineering

The work to be done in feature engineering includes:

  • Data preprocessing: including missing value filling, categorical feature coding, etc.;
  • Feature construction: construct new features;
  • Feature selection: select the most important features or achieve feature dimensionality reduction. (This article does not involve feature selection for the time being)

table of Contents:

1. Data preprocessing

1.1. Missing value filling

Missing value filling strategy:

  • When the missing value is less than 20%, the numerical feature can be filled in with the mean or median, and the categorical feature can be filled in with the mode or the missing value can be counted as a single category;
  • When the missing value is between 20% and 80%, the filling method is the same as above, but the numerical feature needs to generate an indicator dummy variable to participate in the subsequent modeling;
  • When the missing value is more than 80%, the variable with missing value can generate an indicator dummy variable, and the original variable is no longer used.
  • Use the fillna function that comes with pd.DataFrame to fill in missing values ​​in the data. Fillna can fill in statistics (such as mean, median, mode) or specify a constant.
  • Numerical features are filled with median
  • Categorical features treat missing values ​​as a single category

1.1.1. Filter features with missing values

#选定缺失值少于20%的特征名
missing_columns_under_20=missing_values_result[missing_values_result['% of Total Values']<=20].index
#选定缺失值在20%-80%之间的特征名
missing_columns_over_20=missing_values_result[missing_values_result['% of Total Values']>20].index

1.1.2. Filter out features whose missing values ​​are numeric

#缺失值少于20%的数值型特征名
numeric_missing_columns_under_20=df_train[missing_columns_under_20].select_dtypes(exclude='object').columns
#缺失值在20%-80%的数值型特征名
numeric_missing_columns_over_20=df_train[missing_columns_over_20].select_dtypes(exlude='object').columns

1.1.3. Get the feature name indicating the dummy variable

# 获取指示哑变量特征名
numeric_missing_columns_over_20_isnan=list(map(lambda x: x+'_ISNAN',numeric_missing_columns_over_20))
# 缺失占比在20%-80%之间
# 获取中位数
median_values_over_20=df_train[numeric_missing_columns_over_20].median()
# 设置指示哑变量
df_train[numeric_missing_columns_over_20_isnan]=df_train[numeric_missing_columns_over_20].isnull().astype(int)
df_test[numeric_missing_columns_over_20_isnan]=df_test[numeric_missing_columns_over_20].isnull().astype(int)

1.1.4. Use the median to fill in missing values

# 对训练集和测试集分别用中位数填补
df_train[numeric_missing_columns_over_20]=df_train[numeric_missing_columns_over_20].fillna(median_values_over_20)
df_test[numeric_missing_columns_over_20]=df_test[numeric_missing_columns_over_20].fillna(median_values_over_20)
# 缺失占比少于20%
# 获取中位数
median_values_under_20=df_train[numeric_missing_columns_under_20].median()
# 对训练集和测试集分别用中位数填补
df_train[numeric_missing_columns_under_20]=df_train[numeric_missing_columns_under_20].fillna(median_values_under_20)
df_test[numeric_missing_columns_under_20]=df_test[numeric_missing_columns_under_20].fillna(median_vvalues_under_20)

So far: the missing values ​​of the numerical features have been filled in.
Next, the missing values ​​are filled in for the categorical features.

1.1.5. Filter out missing value feature names

missing_value_columns=missing_value_result.index

1.1.6. Filter out categorical features on this basis

categorical_missing_columns=df_train['missing_value_columns'].select_dtypes(include='object').columns

1.1.7. Use the string'NaN' to fill in missing values ​​uniformly

df_train[categorical_missing_columns]=df_train[categorical_missing_columns].replace({
    
    np.nan:'NaN'})
df_test[categorical_missing_columns]=df_train[categorical_missing_columns].replace({
    
    np.nan:'NaN'})

1.1.8. Check if there are missing values

missing_values(df_train)
# 查看输出:
SK_ID_CURR                            0
TARGET                                0
NAME_CONTRACT_TYPE                    0
CODE_GENDER                           0
FLAG_OWN_CAR                          0
                                     ..
FLOORSMAX_MODE_ISNAN                  0
YEARS_BEGINEXPLUATATION_AVG_ISNAN     0
YEARS_BEGINEXPLUATATION_MEDI_ISNAN    0
YEARS_BEGINEXPLUATATION_MODE_ISNAN    0
TOTALAREA_MODE_ISNAN                  0
Length: 167, dtype: int64

1.2. Categorical feature coding

1.2.1. Natural Number Coding

Natural number encoding (Label Encoding) is to take a natural number for each character of the feature, and use preprocessing.LabelEncoder to realize digital
encoding. The disadvantage of natural number encoding: the order of character encoding is considered set and there is no established rule.

1.2.2. One-hot encoding

One-hot Encoding creates a new column for each character of the feature.

  • The commonly used function for one-hot encoding is get_dummies() in pandas.
  • Disadvantages of one-hot encoding: too many feature values ​​will lead to a sharp increase in the dimensionality after encoding.
  • It is recommended to use one-hot encoding for categorical feature encoding.
  • Do one-hot encoding for the training set and test set:
df_train=pd.get_dummies(df_train)
df_test=pd.get_dummies(df_test)
# 输出结果:
print('Training Feature shape:',df_train.shape)
print('Testing Feature shape:',df_train.shape)

After the one-hot encoding, the characteristics of the training set and the test set may be different, because the value of the categorical feature of the training set may not be in the test set, so data alignment operations are required, that is, the training set after the one-hot encoding is removed. Features that are not in the test set to keep the training set and test set having the same characteristics.

  • Data alignment calls the align function in pandas.
# 数据对齐会把训练集的TARGET去掉,所以要提前保存
train_labels=df_train['TARGET']
# join='inner'表示数据按照重叠部分的索引做数据对齐
# axis=1表示按列索引做数据对齐
df_train.align(df_test,join='inner',axis=1)
df_train['TARGET']=train_labels
# 输出结果:
print('Training Features shape:',df_train.shape)
print('Testing Features shape',df_test.shape)

2. Build new features

2.1. Polynomial features

A polynomial feature is a new feature that multiple features are combined in a certain way (commonly multiplied, such as EXT_SOURCE_1 EXT_SOURCE_2), or high-order terms such as EXT_SOURCE^2,
or a combination of the two EXT_SOURCE_1
EXT_SOURCE^2).
You can try some of these combined features to see if it will help the model to predict the user's repayment.
Use sklearn's PolynomialFeature function to make polynomial features for EXT_SOURCE and DAYS_BIRTH.
Here, the order of the combined features is restricted to not higher than 3 (too high order leads to too many combined features, and the model is easy to overfit).

2.1.1. Construct polynomial features within 3rd order

from sklearn.preprocessing import PolynomialFeatures
# 选取想要的特征
columns_select=['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH']
df_train_select=df_train[columns_select]
df_test_select=df_test[columns_select]
# 构造阶数为3的PolynomialFeatures对象
poly_transform=PolynomialFeatures(degree=3,include_bias=False)
# 根据训练集特征,获取构造的多项式特征
poly_transformer.fit(df_train_select)
# 对训练集和测试集构造多项式特征
poly_features_train=poly_transformer.transform(df_train_select)
poly_features_test=poly_transformer.transform(df_test_select)
print('Polynomial Feature shape:'poly_features_train.shape)

2.2. Domain knowledge characteristics

If you are familiar with credit card business, you can also construct domain knowledge features based on the business, such as:

  • CREDIT_INCOME_PERCENT: loan amount as a percentage of user income
  • ANNUITY_INCOME_PERCENT: Loan annuity (annual repayment amount) as a percentage of user income
  • CREDIT_TERM: number of repayment periods
  • DAYS_EMPLOYED_PERCENT: The ratio of working years to the user's age

The specific structure of domain knowledge features is as follows:

  • CREDIT_INCOME_PERCENT=AMT_CREDIT/AMT_INCOME_TOTAL
  • ANNUITY_INCOME_PERCENT=AMT_ANNUITY/AMT_INCOME_TOTAL
  • CREDIT_TERM=AMT_CREDIT/AMT_ANNUITY
  • DAYS_EMPLOYED_PERCENT=DAYS_EMPLOYED/DAYS_BIRTH

The meanings of the features are as follows:

  • AMT_CREDIT: loan amount
  • AMT_INCOME_TOTAL: User income
  • AMT_ANNUITY: Annuity
df_train_domain=df_train.copy()
df_test_domain=df_test.copy()

For the training set:

df_train_domain['CREDIT_INCOME_PERCENT']=df_train_domain['AMT_CREDIT']/df_train_domain['AMT_INCOME_TOTAL']
df_train_domain['ANNUITY_INCOME_PERCENT']=df_train_domain['AMT_ANNUITY']/df_train_domain['AMT_INCOME_TOTAL']
df_train['CREDIT_TERM']=df_train_domain['AMT_CREDIT']/df_train['AMT_ANNUITY']
df_train_domain['DAYS_EMPLOYED_PERCENT']=df_train_domain['DAYS_EMPLOYED']/df_train_domain['DAYS_BIRTH']

For the test set:

df_test_domain['CREDIT_INCOME_PERCENT']=df_test_domain['AMT_CREDIT']/df_test_domain['AMT_INCOME_TOTAL']
df_test_domain['ANNUITY_INCOME_PERCENT']=df_test_domain['AMT_ANNUITY']/df_test_domain['AMT_INCOME_TOTAL']
df_test_domain['CREDIT_TERM']=df_test_domain['AMT_CREDIT']/df_test_domain['AMT_ANNUITY']
df_test_domain['DAYS_EMPLOYED_PERCENT']=df_test_domain['DAYS_EMPLOYED']/df_test_domain['DAYS_BIRTH']

Visualization of new features:
draw the probability density distribution map of domain knowledge features under different labels

plt.figure(figsize=(12,20))
# 循环查看特征
for i,feature in enumerate(['CREDIT_INCOME_PERCENT','ANNUITY_INCOME_PERCENT','CREDIT_TERM','DAYS_EMPLOYED_PERCENT']):
	#创建子图
	plt.subplot(4,1,i+1)
	#画出还款用户的领域知识特征的概率密度分布
	sns.kdeplot(df_train_domain.iloc[df_train_domain['TARGET']==0,feature],label='TARGET==0')
	#画出违约用户的 领域知识 特征的概率密度分布
	sns.kdeplot(df_train_domain.iloc[df_train_domain['TARGET']==1,feature],label='TARGET==1')
	#设置标题和坐标轴
	plt.title('Distribution of %s by Target Value' %feature)
	plt.xlabel('%s' %feature)
	plt.ylabel('Density')
plt.tight_layout(h_pad=2.5)

Insert picture description here

I feel that these features have a great effect on the label. Of course, it depends on the result of inputting them into the model.

  • The whole process involves a lot of manipulation of features, whether it is a great challenge to brain power, mental power or physical power.
  • But the insight trend series is not over yet, the next article needs to build a model

Guess you like

Origin blog.csdn.net/weixin_42961082/article/details/113856434