Loan Default Prediction of Tianchi Learning Competition——Task2 Data Analysis

Loan Default Prediction of Tianchi Learning Competition——Task2 Data Analysis

Due to the time conflict with the mathematical modeling competition, this blog post is mainly reproduced with reference to the official tutorial, and this blog will be improved in the future!

1. Content introduction

  • Overall understanding of data:
    • Read the data set and understand the size of the data set, the original feature dimension;
    • Get familiar with data types through info;
    • Roughly view the basic statistics of each feature in the data set;
  • Missing and unique values:
    • View data missing values
    • View unique value characteristics
  • In-depth data-view data types
    • Categorical data
    • Numerical data
      • Discrete numeric data
      • Continuous numeric data
  • Correlation between data
    • Features and the relationship between features
    • Relationship between characteristics and target variables
  • Use pandas_profiling to generate data reports

Two, code example

  • Import libraries needed for data analysis and visualization process
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
warnings.filterwarnings('ignore')
  • Read file
data_train = pd.read_csv('./train.csv')
data_test_a = pd.read_csv('./testA.csv')
  • General understanding

View the number of samples and original feature dimensions of the data set

data_test_a.shape
data_train.shape
data_train.columns
data_train.info()
data_train.describe()
data_train.head(3).append(data_train.tail(3))
  • View feature missing values, unique values, etc. in the data set
  1. View missing values
print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.')
There are 22 columns in train dataset with missing values.

The training set obtained above has 22 columns of features with missing values. Take a closer look at the missing features with a missing rate greater than 50%.

have_null_fea_dict = (data_train.isnull().sum()/len(data_train)).to_dict()
fea_null_moreThanHalf = {
    
    }
for key,value in have_null_fea_dict.items():
    if value > 0.5:
        fea_null_moreThanHalf[key] = value

Specific view of missing features and missing rate

# nan可视化
missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()
  • Know which columns have "nan" vertically, and print the number of nans. The main purpose is to check whether the number of nans in a certain column is really large. If there are too many nans, it indicates the influence of this column on the label. Almost no effect, you can consider deleting it. If the missing value is very small, you can generally choose to fill.
  • In addition, it can be compared horizontally. If in the data set, most of the columns of some sample data are missing and the sample is sufficient, consider deleting.

Tips: The
lgb model of the game killer can automatically handle missing values, and the Task4 model will learn the model to understand the model!

View the features with only one value for the feature attribute in the training set and test set

one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1]
one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() <= 1]
one_value_fea
['policyCode']
one_value_fea_test
['policyCode']
print(f'There are {len(one_value_fea)} columns in train dataset with one unique value.')
print(f'There are {len(one_value_fea_test)} columns in test dataset with one unique value.')
There are 1 columns in train dataset with one unique value.
There are 1 columns in test dataset with one unique value.
  • Use pandas_profiling to generate data reports
import pandas_profiling
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("./example.html")

.

  • Use pandas_profiling to generate data reports
import pandas_profiling
pfr = pandas_profiling.ProfileReport(data_train)
pfr.to_file("./example.html")

Guess you like

Origin blog.csdn.net/xylbill97/article/details/108668454
Recommended