python—Visualization of missing values: missingno library

Article Directory

introduction

When you come across several G or even larger data, how do you check the lack of structured data? Can't visualize with the naked eye manually with excel sheet, right? At this time, the missingno library is used. The
missingno library is very powerful! ! !
The following shows the simplicity and utility of the missingno library

Code

Data set extraction code: 1234

# 这个数据有6.19G
import missingno as mg
import pandas as pd
import numpy as np

try:
    # 加载数据特别快
    dtrain = pd.read_parquet('dtrain.parquet', engine='auto')
except:
    dtrain = pd.read_csv('../input/jane-street-market-prediction/train.csv', index_col='ts_id')
    # 缩小内存
    dtrain = dtrain.astype({
    
    c: np.float32 for c, t in dtrain.dtypes.items() if t == np.float64})
    # 转变成parquet格式
    dtrain.to_parquet('dtrain.parquet')
print('数据加载完成')
# 缺失值可视化—展示部分
# dtrain表示类型为dataframe的表格,sample(5000)表示抽取表格中5000个样本
mg.matrix(dtrain.sample(5000))

Insert picture description here
The more white lines, the more missing values! ! !
Insert picture description here

Insert picture description here

Generate heat maps to show missing relationships between features

mg.heatmap(dtrain,figsize=(16,16))

Insert picture description here
When the popularity is 1, it indicates that when a column of features has missing values, another column of features must be missing

# 画组合图
msno.dendrogram(dtrain.iloc[:5000,:])

Guess you like

Origin blog.csdn.net/weixin_46649052/article/details/112920759