Exploratory Data Analysis EDA

First, the overall understanding of the data

Basic summary information

head() # View the first 5 lines of data (the default value of parentheses is 5)

tail() # View the last 5 lines of data (the default value of parentheses is 5)

info() #View data introduction (basic information for each feature)

describe() #View statistics (total, mean, variance)

shape #Number of samples and feature dimension train.shape

columns #View feature name

When the viewing data is too long, the intermediate data is omitted and can be used

The feature is too long, and transpose (.T) is added when omitted in the middle:

Such as: head().T  

Display all rows/columns and set the length of value display

#设置value的显示长度为200,默认为50
pd.set_option('max_colwidth',200)
#显示所有列,把行显示设置成最大
pd.set_option('display.max_columns', None)
#显示所有行,把列显示设置成最大
pd.set_option('display.max_rows', None)

 

2. Understanding of data types

Features are generally divided into categorical features and numerical features, and numerical types are further divided into continuous and discrete types.

Numerical features can be directly entered into the model, but often risk control personnel need to bin them, convert them into WOE codes, and then perform operations such as standard score cards. From the perspective of model effect, feature binning is mainly to reduce the complexity of variables, reduce the impact of variable noise on the model, and improve the correlation between independent variables and dependent variables. This makes the model more stable.
 

Numerical features

# 数值类型
numerical_feature = list(train.select_dtypes(exclude=['object']).columns)
numerical_feature

Since the numerical type can be divided into continuous variables, discrete variables and single-valued variables

# 连续型变量
serial_feature = []
# 离散型变量
discrete_feature = []
# 单值变量
unique_feature = []

for fea in numerical_feature:
    temp = train[fea].nunique()# 返回的是唯一值的个数
    if temp == 1:
        unique_feature.append(fea)
     # 自定义变量的值的取值个数小于10就为离散型变量    
    elif temp <= 10:
        discrete_feature.append(fea)
    else:
        serial_feature.append(fea)

continuous variable

Check the distribution of a numerical variable to see if the variable conforms to the normal distribution. If the variable does not conform to the normal distribution, you can log it and then observe whether it conforms to the normal distribution.

Reasons for normalization: In some cases, normality and non-normality can make the model converge faster. Some models require the data to be normal (eg. GMM, KNN), and ensure that the data is not too skewed. Too skewed may cause problems. affect the model prediction results.

Visually check for normality.

#每个数字特征得分布可视化
f = pd.melt(train, value_vars=serial_feature)
g = sns.FacetGrid(f, col="variable",  col_wrap=3, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

View the distribution of continuous variables

import seaborn as sns
plt.figure(1 , figsize = (8 , 5))
sns.distplot(train.特征,bins=40)
plt.xlabel('特征')

 View the distribution of discrete variables:

# 查看label的
import seaborn as sns
sns.kdeplot(train.loanAmnt[label[label==1].index], label='1', shade=True)
sns.kdeplot(train.loanAmnt[label[label==0].index], label='0', shade=True)
plt.xlabel('loanAmnt')
plt.ylabel('Density');

 Check the distribution of annualIncome:

plt.figure(1 , figsize = (8 , 5))
sns.distplot(train['annualIncome'])
plt.xlabel('annualIncome')

 discrete variable

Number of types of discrete variables

for f in discrete_feature:
    print(f, '类型数:', train[f].nunique())

 visualization

import seaborn as sns
import matplotlib.pyplot as plt
df_ = train[discrete_feature]# 离散型变量
sns.set_style("whitegrid") # 使用whitegrid主题
fig,axes=plt.subplots(nrows=1,ncols=1,figsize=(8,10))# nrows=4,ncols=2,括号加参数4x2个图
for i, item in enumerate(df_):
    plt.subplot(4,2,(i+1))
    ax=sns.countplot(item,data = df_,palette="Pastel1")
    plt.xlabel(str(item),fontsize=14)   
    plt.ylabel('Count',fontsize=14)
    plt.xticks(fontsize=13)
    plt.yticks(fontsize=13)
    #plt.title("Churn by "+ str(item))
    i=i+1
    plt.tight_layout()
plt.show()

single-valued variable

A single-valued variable indicates that the feature has only one category. For features with all the same values, you can consider deleting them directly.

Classification characteristics

# 分类型特征
category_feature = list(filter(lambda x: x not in numerical_feature,list(train.columns)))
category_feature

Visual presentation of classification features

df_category = train[['label']]

sns.set_style("whitegrid") # 使用whitegrid主题
color = sns.color_palette()
fig,axes=plt.subplots(nrows=1,ncols=1,figsize=(10,10))
for i, item in enumerate(df_category):
    plt.subplot(2,1,(i+1))
    #ax=df[item].value_counts().plot(kind = 'bar')
    ax=sns.countplot(item,data = df_category)
    plt.xlabel(str(item),fontsize=14)   
    plt.ylabel('Count',fontsize=14)
    plt.xticks(fontsize=13)
    plt.yticks(fontsize=13)
    #plt.title("Churn by "+ str(item))
    i=i+1
    plt.tight_layout()
plt.show()

Count the number of each category

for i in train[['label']]:
    print(train[i].value_counts())
    print()

Third, the distribution of labels

See if the labels are balanced

If the difference in the number of samples of each category in the sub-problem is too large, it will cause the problem of unbalanced samples. Unbalanced samples are not conducive to establishing and training a correct model, and cannot make a reasonable evaluation.

label=train.isDefault             
label.value_counts()/len(label)

visualization

sns.countplot(label)

If the proportions of the categories are very different and the samples are unbalanced, in this case, consider subsequent operations such as sampling

Distribution relationship between labels and categorical categories

train_loan_fr = train.loc[train['isDefault'] == 1]
train_loan_nofr = train.loc[train['isDefault'] == 0]

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 8)) 
# 目标变量为1时候grade的分布
train_loan_fr.groupby("grade").size().plot.bar(ax=ax1)
# 目标变量为0时候grade的分布
train_loan_nofr.groupby("grade")["grade"].count().plot.bar(ax=ax2)
# 目标变量为1时候employmentLength的分布
train_loan_fr.groupby("employmentLength").size().plot.bar(ax=ax3)
# 目标变量为0时候employmentLength的分布
train_loan_nofr.groupby("employmentLength")["employmentLength"].count().plot.bar(ax=ax4)
plt.xticks(rotation=90);

View data differences between positive and negative samples

Divide the data set into two parts according to positive and negative samples to view the distribution difference of variables

train_positve = train[train['isDefault'] == 1]
train_negative = train[train['isDefault'] != 1]
f, ax = plt.subplots(len(numerical_feature),2,figsize = (10,80))
for i,col in enumerate(numerical_feature):
    sns.distplot(train_positve[col],ax = ax[i,0],color = "blue")
    ax[i,0].set_title("positive")
    sns.distplot(train_negative[col],ax = ax[i,1],color = 'red')
    ax[i,1].set_title("negative")
plt.subplots_adjust(hspace = 1)

Fourth, missing value viewing

If there are too many missing values, it will have a certain impact on the overall model results. Therefore, you need to check the missing values ​​of the data each time before modeling. If there are missing values, you need to fill in the subsequent feature engineering.

# 去掉标签
X_missing = train.drop(['isDefault'],axis=1)

# 查看缺失情况
missing = X_missing.isna().sum()
missing = pd.DataFrame(data={'特征': missing.index,'缺失值个数':missing.values})
#通过~取反,选取不包含数字0的行
missing = missing[~missing['缺失值个数'].isin([0])]
# 缺失比例
missing['缺失比例'] =  missing['缺失值个数']/X_missing.shape[0]
missing

visualization

# 可视化
(train.isnull().sum()/len(train)).plot.bar(figsize = (20,6),color=['#d6ecf0','#a3d900','#88ada6','#ffb3a7','#cca4e3','#a1afc9'])

 

Five, abnormal value view

In statistics, if a data distribution is approximately normal, then about 68% of the data values ​​will be within one standard deviation of the mean, about 95% will be within two standard deviations, and about 99.7% will be within three standard deviations. within the difference.

def find_outliers_by_3segama(data,fea):
    data_std = np.std(data[fea])
    data_mean = np.mean(data[fea])
    outliers_cut_off = data_std * 3
    lower_rule = data_mean - outliers_cut_off
    upper_rule = data_mean + outliers_cut_off
    data[fea+'_outliers'] = data[fea].apply(lambda x:str('异常值') if x > upper_rule or x < lower_rule else '正常值')
    return data
data_train = train.copy()
for fea in numerical_feature:
    data_train = find_outliers_by_3segama(data_train,fea)
    print(data_train[fea+'_outliers'].value_counts())
    print(data_train.groupby(fea+'_outliers')['isDefault'].sum())
    print('*'*10)

visualization

import matplotlib.pyplot as pl
plt.boxplot(train)

6. Data related relationships

View the correlation coefficient of each feature with the target.

train.corr()["isDefault"].sort_values()

visualization

f, ax = plt.subplots(1,1, figsize = (20,20))
cor = train[numerical_feature].corr()
sns.heatmap(cor, annot = True, linewidth = 0.2, linecolor = "white", ax = ax, fmt =".1g" )

High correlation between features

Filtering above 0.6 between pairwise features

# 显示相关性高于0.6的变量
def getHighRelatedFeatureDf(corr_matrix, corr_threshold):
    highRelatedFeatureDf = pd.DataFrame(corr_matrix[corr_matrix>corr_threshold].stack().reset_index())
    highRelatedFeatureDf.rename({'level_0':'feature_x', 'level_1':'feature_y', 0:'corr'}, axis=1, inplace=True)
    highRelatedFeatureDf = highRelatedFeatureDf[highRelatedFeatureDf.feature_x != highRelatedFeatureDf.feature_y]
    highRelatedFeatureDf['feature_pair_key'] = highRelatedFeatureDf.loc[:,['feature_x', 'feature_y']].apply(lambda r:'#'.join(np.sort(r.values)), axis=1)
    highRelatedFeatureDf.drop_duplicates(subset=['feature_pair_key'],inplace=True)
    highRelatedFeatureDf.drop(['feature_pair_key'], axis=1, inplace=True)
    return highRelatedFeatureDf

getHighRelatedFeatureDf(train.corr(),0.6)

おすすめ

転載: blog.csdn.net/qq_21402983/article/details/126071556