Article Directory
introduction
When you come across several G or even larger data, how do you check the lack of structured data? Can't visualize with the naked eye manually with excel sheet, right? At this time, the missingno library is used. The
missingno library is very powerful! ! !
The following shows the simplicity and utility of the missingno library
Code
Data set extraction code: 1234
# 这个数据有6.19G
import missingno as mg
import pandas as pd
import numpy as np
try:
# 加载数据特别快
dtrain = pd.read_parquet('dtrain.parquet', engine='auto')
except:
dtrain = pd.read_csv('../input/jane-street-market-prediction/train.csv', index_col='ts_id')
# 缩小内存
dtrain = dtrain.astype({
c: np.float32 for c, t in dtrain.dtypes.items() if t == np.float64})
# 转变成parquet格式
dtrain.to_parquet('dtrain.parquet')
print('数据加载完成')
# 缺失值可视化—展示部分
# dtrain表示类型为dataframe的表格,sample(5000)表示抽取表格中5000个样本
mg.matrix(dtrain.sample(5000))
The more white lines, the more missing values! ! !
Generate heat maps to show missing relationships between features
mg.heatmap(dtrain,figsize=(16,16))
When the popularity is 1, it indicates that when a column of features has missing values, another column of features must be missing
# 画组合图
msno.dendrogram(dtrain.iloc[:5000,:])