1.3 Roughly view the basic statistics of each feature in the data set
train.describe()
2 Missing and unique values
2.1 View data missing values
#存在缺失值的列数
train.isnull().any().sum()
#查看是否存在一半以上缺失值的列
have_null_fea_dict =((train.isnull().sum())/len(train)).to_dict()
fea_null ={
}for k,v in have_null_fea_dict.items():if v >0.5:
fea_null[k]= v
print(fea_null)
2.2 View missing features and missing rate
missing =(train.isnull().sum())/len(train)
miss = missing[missing>0]# miss.sort_values(ascending = True)
miss = miss.sort_values(ascending=True)
miss.plot.bar()
2.3 View the features with only one value in the feature attribute in the training set and test set
numerical_fea =list(train.select_dtypes(exclude=['object']).columns)#数据类型
category_fea =list(filter(lambda x: x notin numerical_fea,list(train.columns)))#对象类型print(numerical_fea)print(category_fea)
2.5 Analysis of numerical variables, including continuous variables and discrete variables
2.5.1 Divide continuous variables and discrete variables in numerical variables
2.5.3 Numerical continuous variable analysis - distribution visualization for each numerical feature
f = pd.melt(train,value_vars=numerical_serial_fea)
g = sns.FacetGrid(f,col='variable',col_wrap=4,sharex=False,sharey=False)
g = g.map(sns.distplot,'value')
2.6 Analysis of non-numerical categorical variables