Simple and easy to use, share 4 Pandas automatic data analysis artifacts

When we do data analysis, when we first get a data set, we generally use statistical or visual methods to understand the original data. Understand the number of columns, the number of rows, the distribution of values, missing values, the correlation between columns, etc. This process is called EDA(Exploratory Data Analysis).

There are already many EDAtools that can automatically generate basic statistics and charts, which are also recognized as good tools in the technical exchange group .

[Note] Join the technical exchange group at the end of the article

recommended article

This article will compare and introduce 4 commonly used EDAtools, almost abandoning the rhythm of the code. Like to like, favorite, follow.

Before formally introducing these tools, let's load the dataset

import numpy as np
import pandas as pd
iris = pd.read_csv('iris.csv')
iris

picture

irisis the dataset used below, which is a 150行 * 4列DataFrame.

1. PandasGUI

PandasGUIProvide data preview, filtering, statistics, various chart display and data conversion.

# 安装
# pip install pandasgui
from pandasgui import show

show(iris)

picture

PandasGUI operation interface

PandasGUIIt focuses more on data display, providing more than 10 kinds of charts, which can be configured in a visual way.

However, the data statistics are relatively simple, and no indicators such as missing values ​​and correlation coefficients are provided, and only a small number of interfaces are opened in the data conversion part.

2. Pandas Profiling

Pandas ProfilingProvides an overview of the overall data, details of each column, correlation plots between columns, and correlation coefficients between columns.

# 安装:
# pip install -U pandas-profiling
# jupyter nbextension enable --py widgetsnbextension

from pandas_profiling import ProfileReport

profile = ProfileReport(iris, title='iris Pandas Profiling Report', explorative=True)
profile

picture

Pandas Profiling interface

The details of each column include: missing value statistics, deduplication counts, most value, average and other statistical indicators and a histogram of value distribution.

The correlation coefficient between columns supports Spearman, Pearson, Kendall and Phik 4 correlation coefficient algorithms.

PandasGUIIn contrast , Pandas Profilingthere are no rich graphs, but a very large number of statistical indicators and correlation coefficients are provided.

3. Sweetviz

SweetvizSimilarly Pandas Profiling, detailed statistical indicators, value distribution, missing value statistics, and correlation coefficients between columns are provided for each column.

# 安装
# pip install sweetviz

import sweetviz as sv

sv_report = sv.analyze(iris)
sv_report.show_html()

picture

Sweetviz operation interface

SweetvizThere is also a very good feature that supports the comparison of different data sets, such as: the comparison of training data sets and test data sets.

picture

Sweetviz dataset comparison

Blue and orange represent different data sets, and the differences between the previous data sets can be clearly found by comparison.

4. dtale

Finally, a heavy introduction dtale, it not only provides rich charts to display data, but also provides many interactive interfaces to operate and transform data.

picture

dtale operation interface

dtaleThe functions are mainly divided into three parts: data manipulation , data visualization, and highlighting .

4.1 Data Operations (Actions)

dtaleWrapping pandasthe function into a visual interface allows us to manipulate data through a graphical interface.

# pip install dtale

import dtale

d = dtale.show(iris)
d.open_browser()

picture

Actions

The picture in the right half is the Chinese translation of the picture on the left, which is automatically translated by Chrome, and some are not very accurate.

Give an example of data manipulation .

picture

Summarize Data

The figure above is the function of Summarize Data in the Actions menu , which provides an interface for summarizing operations on datasets.

In the above figure, we choose to speciesgroup by column and calculate sepal_widththe average value of the column. At the same time, we can see dtalethat the code has been automatically generated for this operation in the lower left corner pandas.

4.2 Data visualization (Visualize)

Provides richer charts, statistics and displays for each column of data overview, duplicate rows, missing values, and correlation coefficients.

picture

Visualize

Take an example of data visualization .

picture

Describe

The above picture is the Describe function in the Visualize menu , which can count the most value, mean, standard deviation and other indicators of each column, and provide a chart display.

On the right Code Exportyou can see the code that generates this data.

4.3 Highlight

Highlight missing values ​​and outliers, so that we can quickly locate abnormal data.

picture

Highlight

picture

The image above shows the sepal_widthoutliers for the fields.

dtaleIt is very powerful and has many functions. You can explore and dig a lot.

Finally, a brief summary. If the data set to be explored focuses on data display, you can choose PandasGUI; if you simply understand the basic statistical indicators, you can choose Pandas Profilingand Sweetviz; if you need to do in-depth data exploration, then choose dtale.

Technology Exchange

Welcome to reprint, collect, like and support!

insert image description here

At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends

  • Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
  • Method ②, add micro-signal: dkl88191 , note: from CSDN
  • Method ③, WeChat search public account: Python learning and data mining , background reply: add group

long press follow

Guess you like

Origin blog.csdn.net/weixin_38037405/article/details/123702729