Financial Risk Control Task3-Feature Engineering

learning target

  • Learn feature processing methods such as feature preprocessing, missing values, outlier processing, and data bucketing
  • Learn corresponding methods for feature interaction, encoding, and selection

Introduction

  • Data preprocessing:
  • filling of missing values
  • Time Format Handling
  • object type trait conversion to value
  • Outlier handling:
  • Based on the 3segama principle
  • Based on box plot
  • Data binning:
  • Fixed width binning
  • quantile fraction box
  • Binning discrete numeric data
  • Continuous numeric data binning
  • Chi-square binning
  • Feature interaction:
  • Combination of features and features
  • Derivation between features and features
  • Other attempts at feature derivation
  • Feature encoding:
  • one-hot encoding
  • label-encode encoding
  • Feature selection:
  • Filter
  • Wrapper
  • Embedded

code example

Import the package and read the data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import MinMaxScaler
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor
import warnings
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
warnings.filterwarnings('ignore')
train = pd.read_csv('./data/train.csv')
testA = pd.read_csv('./data/testA.csv')

feature preprocessing

numerical_fea = list(train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(train.columns)))
label = 'isDefault'
numerical_fea.remove(label)
Missing value filling
# 查看缺失值
train.isnull().sum()

insert image description here

# 按照平均数填充数值特征
train[numerical_fea] = train[numerical_fea].fillna(train[numerical_fea].median())
testA[numerical_fea] = testA[numerical_fea].fillna(train[numerical_fea].median())

# #按照众数填充类别型特征
train[category_fea] = train[category_fea].fillna(train[category_fea].mode())
testA[category_fea] = testA[category_fea].fillna(train[category_fea].mode())
train.isnull().sum()

insert image description here

Time Format Handling

for data in [train, testA]:
    data['issueDate'] = pd.to_datetime(data['issueDate'],format='%Y-%m-%d')
    startdate = datetime.datetime.strptime('2007-06-01','%Y-%m-%d')
    data['issueDateDT'] = data['issueDate'].apply(lambda x: x-startdate).dt.days

train['employmentLength'].value_counts(dropna=False).sort_index()

insert image description here

Object type feature conversion to numeric feature

def employmentLength_to_int(s):
    if pd.isnull(s):
        return s
    else:
        return np.int8(s.split()[0])

for data in [train,testA]:
    data['employmentLength'].replace('10+ years','10 years', inplace=True)
    data['employmentLength'].replace('< 1 year','0 year', inplace=True)
    data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
data['employmentLength'].value_counts(dropna=False).sort_index()

insert image description here

train['earliesCreditLine'].sample(5)

insert image description here

for data in [train,testA]:
    data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:]))

Class feature processing

cate_features = ['grade', 'subGrade', 'employmentTitle', 'homeOwnership','verificationStatus', 'purpose', 'postCode', 'regionCode', 'applicationType', 'initialListStatus', 'title', 'policyCode']
for f in cate_features:
    print(f, '类型数:', data[f].nunique())

insert image description here

Outlier handling

Method to detect anomalies: mean square error
def find_outliers_by_3segama(data,fea):
    data_std = np.std(data[fea])
    data_mean = np.mean(data[fea])
    outliers_cut_off = data_std*3
    lower_rule = data_mean - outliers_cut_off
    upper_rule = data_mean + outliers_cut_off
    data[fea+'_outliers'] = data[fea].apply(lambda x:str('异常值') if x > upper_rule or x < lower_rule else '正常值')
    return data
data_train = data_train.copy()
for fea in numerical_fea:
    data_train = find_outliers_by_3segama(data_train,fea)
    print(data_train[fea+'_outliers'].value_counts())
    print(data_train.groupby(fea+'_outliers')['isDefault'].sum())
    print('*'*10)

insert image description here

## 删除异常值
for fea in numerical_fea:
    data_train = data_train[data_train[fea+'_outliers'] == '正常值']
    data_train = data_train.reset_index(drop=True)

Data bucketing

  • The purpose of feature binning:
  • From the model effect point of view, feature binning is mainly to reduce the complexity of variables, reduce the impact of variable noise on the model, and improve the correlation between independent variables and dependent variables. This makes the model more stable.
  • The object of data bucketing:
  • Discretize continuous variables
  • Merge multi-state discrete variables into few-state
  • Reasons for binning:
  • The value span in the feature of the data may be relatively large. For supervised and unsupervised methods such as k-means clustering, it uses Euclidean distance as a similarity function to measure the similarity between data points. Both will cause big and small impacts. One of the solutions is to intervalize the count value, that is, data bucketing, also known as data binning, and then use the quantified results.
  • Advantages of binning:
  • Handle missing values: When there may be missing values ​​in the data source, null can be used as a separate bin at this time.
  • Dealing with outliers: When there are outliers in the data, they can be discretized by binning to improve the robustness (anti-interference ability). For example, if there is an abnormal value of 200 in age, it can be classified into the bin of "age > 60" to exclude the influence.
  • Business explanation: We are used to linearly judging the role of variables. When x gets bigger and bigger, y gets bigger and bigger. However, there is often a nonlinear relationship between x and y, which can be transformed by WOE at this time.
  • Pay special attention to the basic principles of binning:
  • The minimum binning ratio is not less than 5%
  • Not all good customers in the box
  • continuous bin monotonic
Fixed width binning
当数值横跨多个数量级时,最好按照 10 的幂(或任何常数的幂)来进行分组:09、1099、100999、10009999,等等。固定宽度分箱非常容易计算,但如果计数值中有比较大的缺口,就会产生很多没有任何数据的空箱子。
# 通过除法映射到间隔均匀的分箱中,每个分箱的取值范围都是loanAmnt/1000
data['loanAmnt_bin1'] = np.floor_divide(data['loanAmnt'], 1000)
## 通过对数函数映射到指数宽度分箱
data['loanAmnt_bin2'] = np.floor(np.log10(data['loanAmnt']))
quantile fraction box
data['loanAmnt_bin3'] = pd.qcut(data['loanAmnt'], 10, labels=False)
feature interaction
for col in ['grad','subGrade']:
    temp_dict = data_train.groupby([col])['isDefault'].agg(['mean']).reset_index().rename(columns={
    
    'mean':col+'_target_mean'})
    temp_dict.index = temp_dict[col].values
    temp_dict = temp_dict[col + '_target_mean'].to_dict()
    data_train[col + '_target_mean'] = data_train[col].map(temp_dict)
    data_test_a[col + '_target_mean'] = data_test_a[col].map(temp_dict)
# 其他衍生变量 mean 和 std
for df in [data_train, data_test_a]:
for item in
['n0','n1','n2','n2.1','n4','n5','n6','n7','n8','n9','n10','n11','n12','n13','n14']:
df['grade_to_mean_' + item] = df['grade'] / df.groupby([item])
['grade'].transform('mean')
df['grade_to_std_' + item] = df['grade'] / df.groupby([item])
['grade'].transform('std')

feature encoding

labelEncode is put directly into the tree model
#label-encode:subGrade,postCode,title
# 高维类别特征需要进行转换
for col in tqdm(['employmentTitle', 'postCode', 'title','subGrade']):
le = LabelEncoder()
le.fit(list(data_train[col].astype(str).values) +
list(data_test_a[col].astype(str).values))
data_train[col] = le.transform(list(data_train[col].astype(str).values))
data_test_a[col] = le.transform(list(data_test_a[col].astype(str).values))
print('Label Encoding 完成')

feature selection

  • Filter
    a. Variance selection method
    b. Correlation coefficient method (pearson correlation coefficient)
    c. Chi-square test
    d. Mutual information method
  • Wrapper (RFE)
    a. Recursive feature elimination method
  • Embedded
    a. Feature selection method based on penalty item
    b. Feature selection based on tree model

Guess you like

Origin blog.csdn.net/BigCabbageFy/article/details/108719488
Recommended