First, the overall understanding of the data
Basic summary information
head() # View the first 5 lines of data (the default value of parentheses is 5)
tail() # View the last 5 lines of data (the default value of parentheses is 5)
info() #View data introduction (basic information for each feature)
describe() #View statistics (total, mean, variance)
shape #Number of samples and feature dimension train.shape
columns #View feature name
When the viewing data is too long, the intermediate data is omitted and can be used
The feature is too long, and transpose (.T) is added when omitted in the middle:
Such as: head().T
Display all rows/columns and set the length of value display
#设置value的显示长度为200,默认为50
pd.set_option('max_colwidth',200)
#显示所有列,把行显示设置成最大
pd.set_option('display.max_columns', None)
#显示所有行,把列显示设置成最大
pd.set_option('display.max_rows', None)
2. Understanding of data types
Features are generally divided into categorical features and numerical features, and numerical types are further divided into continuous and discrete types.
Numerical features can be directly entered into the model, but often risk control personnel need to bin them, convert them into WOE codes, and then perform operations such as standard score cards. From the perspective of model effect, feature binning is mainly to reduce the complexity of variables, reduce the impact of variable noise on the model, and improve the correlation between independent variables and dependent variables. This makes the model more stable.
Numerical features
# 数值类型
numerical_feature = list(train.select_dtypes(exclude=['object']).columns)
numerical_feature
Since the numerical type can be divided into continuous variables, discrete variables and single-valued variables
# 连续型变量
serial_feature = []
# 离散型变量
discrete_feature = []
# 单值变量
unique_feature = []
for fea in numerical_feature:
temp = train[fea].nunique()# 返回的是唯一值的个数
if temp == 1:
unique_feature.append(fea)
# 自定义变量的值的取值个数小于10就为离散型变量
elif temp <= 10:
discrete_feature.append(fea)
else:
serial_feature.append(fea)
continuous variable
Check the distribution of a numerical variable to see if the variable conforms to the normal distribution. If the variable does not conform to the normal distribution, you can log it and then observe whether it conforms to the normal distribution.
Reasons for normalization: In some cases, normality and non-normality can make the model converge faster. Some models require the data to be normal (eg. GMM, KNN), and ensure that the data is not too skewed. Too skewed may cause problems. affect the model prediction results.
Visually check for normality.
#每个数字特征得分布可视化
f = pd.melt(train, value_vars=serial_feature)
g = sns.FacetGrid(f, col="variable", col_wrap=3, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
View the distribution of continuous variables
import seaborn as sns
plt.figure(1 , figsize = (8 , 5))
sns.distplot(train.特征,bins=40)
plt.xlabel('特征')
View the distribution of discrete variables:
# 查看label的
import seaborn as sns
sns.kdeplot(train.loanAmnt[label[label==1].index], label='1', shade=True)
sns.kdeplot(train.loanAmnt[label[label==0].index], label='0', shade=True)
plt.xlabel('loanAmnt')
plt.ylabel('Density');
Check the distribution of annualIncome:
plt.figure(1 , figsize = (8 , 5))
sns.distplot(train['annualIncome'])
plt.xlabel('annualIncome')
discrete variable
Number of types of discrete variables
for f in discrete_feature:
print(f, '类型数:', train[f].nunique())
visualization
import seaborn as sns
import matplotlib.pyplot as plt
df_ = train[discrete_feature]# 离散型变量
sns.set_style("whitegrid") # 使用whitegrid主题
fig,axes=plt.subplots(nrows=1,ncols=1,figsize=(8,10))# nrows=4,ncols=2,括号加参数4x2个图
for i, item in enumerate(df_):
plt.subplot(4,2,(i+1))
ax=sns.countplot(item,data = df_,palette="Pastel1")
plt.xlabel(str(item),fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
#plt.title("Churn by "+ str(item))
i=i+1
plt.tight_layout()
plt.show()
single-valued variable
A single-valued variable indicates that the feature has only one category. For features with all the same values, you can consider deleting them directly.
Classification characteristics
# 分类型特征
category_feature = list(filter(lambda x: x not in numerical_feature,list(train.columns)))
category_feature
Visual presentation of classification features
df_category = train[['label']]
sns.set_style("whitegrid") # 使用whitegrid主题
color = sns.color_palette()
fig,axes=plt.subplots(nrows=1,ncols=1,figsize=(10,10))
for i, item in enumerate(df_category):
plt.subplot(2,1,(i+1))
#ax=df[item].value_counts().plot(kind = 'bar')
ax=sns.countplot(item,data = df_category)
plt.xlabel(str(item),fontsize=14)
plt.ylabel('Count',fontsize=14)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
#plt.title("Churn by "+ str(item))
i=i+1
plt.tight_layout()
plt.show()
Count the number of each category
for i in train[['label']]:
print(train[i].value_counts())
print()
Third, the distribution of labels
See if the labels are balanced
If the difference in the number of samples of each category in the sub-problem is too large, it will cause the problem of unbalanced samples. Unbalanced samples are not conducive to establishing and training a correct model, and cannot make a reasonable evaluation.
label=train.isDefault
label.value_counts()/len(label)
visualization
sns.countplot(label)
If the proportions of the categories are very different and the samples are unbalanced, in this case, consider subsequent operations such as sampling
Distribution relationship between labels and categorical categories
train_loan_fr = train.loc[train['isDefault'] == 1]
train_loan_nofr = train.loc[train['isDefault'] == 0]
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8))
# 目标变量为1时候grade的分布
train_loan_fr.groupby("grade").size().plot.bar(ax=ax1)
# 目标变量为0时候grade的分布
train_loan_nofr.groupby("grade")["grade"].count().plot.bar(ax=ax2)
# 目标变量为1时候employmentLength的分布
train_loan_fr.groupby("employmentLength").size().plot.bar(ax=ax3)
# 目标变量为0时候employmentLength的分布
train_loan_nofr.groupby("employmentLength")["employmentLength"].count().plot.bar(ax=ax4)
plt.xticks(rotation=90);
View data differences between positive and negative samples
Divide the data set into two parts according to positive and negative samples to view the distribution difference of variables
train_positve = train[train['isDefault'] == 1]
train_negative = train[train['isDefault'] != 1]
f, ax = plt.subplots(len(numerical_feature),2,figsize = (10,80))
for i,col in enumerate(numerical_feature):
sns.distplot(train_positve[col],ax = ax[i,0],color = "blue")
ax[i,0].set_title("positive")
sns.distplot(train_negative[col],ax = ax[i,1],color = 'red')
ax[i,1].set_title("negative")
plt.subplots_adjust(hspace = 1)
Fourth, missing value viewing
If there are too many missing values, it will have a certain impact on the overall model results. Therefore, you need to check the missing values of the data each time before modeling. If there are missing values, you need to fill in the subsequent feature engineering.
# 去掉标签
X_missing = train.drop(['isDefault'],axis=1)
# 查看缺失情况
missing = X_missing.isna().sum()
missing = pd.DataFrame(data={'特征': missing.index,'缺失值个数':missing.values})
#通过~取反,选取不包含数字0的行
missing = missing[~missing['缺失值个数'].isin([0])]
# 缺失比例
missing['缺失比例'] = missing['缺失值个数']/X_missing.shape[0]
missing
visualization
# 可视化
(train.isnull().sum()/len(train)).plot.bar(figsize = (20,6),color=['#d6ecf0','#a3d900','#88ada6','#ffb3a7','#cca4e3','#a1afc9'])
Five, abnormal value view
In statistics, if a data distribution is approximately normal, then about 68% of the data values will be within one standard deviation of the mean, about 95% will be within two standard deviations, and about 99.7% will be within three standard deviations. within the difference.
def find_outliers_by_3segama(data,fea):
data_std = np.std(data[fea])
data_mean = np.mean(data[fea])
outliers_cut_off = data_std * 3
lower_rule = data_mean - outliers_cut_off
upper_rule = data_mean + outliers_cut_off
data[fea+'_outliers'] = data[fea].apply(lambda x:str('异常值') if x > upper_rule or x < lower_rule else '正常值')
return data
data_train = train.copy()
for fea in numerical_feature:
data_train = find_outliers_by_3segama(data_train,fea)
print(data_train[fea+'_outliers'].value_counts())
print(data_train.groupby(fea+'_outliers')['isDefault'].sum())
print('*'*10)
visualization
import matplotlib.pyplot as pl
plt.boxplot(train)
6. Data related relationships
View the correlation coefficient of each feature with the target.
train.corr()["isDefault"].sort_values()
visualization
f, ax = plt.subplots(1,1, figsize = (20,20))
cor = train[numerical_feature].corr()
sns.heatmap(cor, annot = True, linewidth = 0.2, linecolor = "white", ax = ax, fmt =".1g" )
High correlation between features
Filtering above 0.6 between pairwise features
# 显示相关性高于0.6的变量
def getHighRelatedFeatureDf(corr_matrix, corr_threshold):
highRelatedFeatureDf = pd.DataFrame(corr_matrix[corr_matrix>corr_threshold].stack().reset_index())
highRelatedFeatureDf.rename({'level_0':'feature_x', 'level_1':'feature_y', 0:'corr'}, axis=1, inplace=True)
highRelatedFeatureDf = highRelatedFeatureDf[highRelatedFeatureDf.feature_x != highRelatedFeatureDf.feature_y]
highRelatedFeatureDf['feature_pair_key'] = highRelatedFeatureDf.loc[:,['feature_x', 'feature_y']].apply(lambda r:'#'.join(np.sort(r.values)), axis=1)
highRelatedFeatureDf.drop_duplicates(subset=['feature_pair_key'],inplace=True)
highRelatedFeatureDf.drop(['feature_pair_key'], axis=1, inplace=True)
return highRelatedFeatureDf
getHighRelatedFeatureDf(train.corr(),0.6)