When we do data analysis, when we first get a data set, we generally use statistical or visual methods to understand the original data. Understand the number of columns, the number of rows, the distribution of values, missing values, the correlation between columns, etc. This process is called EDA
(Exploratory Data Analysis).
There are already many EDA
tools that can automatically generate basic statistics and charts, which are also recognized as good tools in the technical exchange group .
[Note] Join the technical exchange group at the end of the article
recommended article
-
Li Hongyi's "Machine Learning" Mandarin Course (2022) is here
-
Someone made a Chinese version of Mr. Wu Enda's machine learning and deep learning
-
I'm addicted, and recently I gave the company a big visual screen (with source code)
-
So elegant, 4 Python automatic data analysis artifacts are really fragrant
This article will compare and introduce 4 commonly used EDA
tools, almost abandoning the rhythm of the code. Like to like, favorite, follow.
Before formally introducing these tools, let's load the dataset
import numpy as np
import pandas as pd
iris = pd.read_csv('iris.csv')
iris
iris
is the dataset used below, which is a 150行 * 4列
DataFrame.
1. PandasGUI
PandasGUI
Provide data preview, filtering, statistics, various chart display and data conversion.
# 安装
# pip install pandasgui
from pandasgui import show
show(iris)
PandasGUI operation interface
PandasGUI
It focuses more on data display, providing more than 10 kinds of charts, which can be configured in a visual way.
However, the data statistics are relatively simple, and no indicators such as missing values and correlation coefficients are provided, and only a small number of interfaces are opened in the data conversion part.
2. Pandas Profiling
Pandas Profiling
Provides an overview of the overall data, details of each column, correlation plots between columns, and correlation coefficients between columns.
# 安装:
# pip install -U pandas-profiling
# jupyter nbextension enable --py widgetsnbextension
from pandas_profiling import ProfileReport
profile = ProfileReport(iris, title='iris Pandas Profiling Report', explorative=True)
profile
Pandas Profiling interface
The details of each column include: missing value statistics, deduplication counts, most value, average and other statistical indicators and a histogram of value distribution.
The correlation coefficient between columns supports Spearman, Pearson, Kendall and Phik 4 correlation coefficient algorithms.
PandasGUI
In contrast , Pandas Profiling
there are no rich graphs, but a very large number of statistical indicators and correlation coefficients are provided.
3. Sweetviz
Sweetviz
Similarly Pandas Profiling
, detailed statistical indicators, value distribution, missing value statistics, and correlation coefficients between columns are provided for each column.
# 安装
# pip install sweetviz
import sweetviz as sv
sv_report = sv.analyze(iris)
sv_report.show_html()
Sweetviz operation interface
Sweetviz
There is also a very good feature that supports the comparison of different data sets, such as: the comparison of training data sets and test data sets.
Sweetviz dataset comparison
Blue and orange represent different data sets, and the differences between the previous data sets can be clearly found by comparison.
4. dtale
Finally, a heavy introduction dtale
, it not only provides rich charts to display data, but also provides many interactive interfaces to operate and transform data.
dtale operation interface
dtale
The functions are mainly divided into three parts: data manipulation , data visualization, and highlighting .
4.1 Data Operations (Actions)
dtale
Wrapping pandas
the function into a visual interface allows us to manipulate data through a graphical interface.
# pip install dtale
import dtale
d = dtale.show(iris)
d.open_browser()
Actions
The picture in the right half is the Chinese translation of the picture on the left, which is automatically translated by Chrome, and some are not very accurate.
Give an example of data manipulation .
Summarize Data
The figure above is the function of Summarize Data in the Actions menu , which provides an interface for summarizing operations on datasets.
In the above figure, we choose to species
group by column and calculate sepal_width
the average value of the column. At the same time, we can see dtale
that the code has been automatically generated for this operation in the lower left corner pandas
.
4.2 Data visualization (Visualize)
Provides richer charts, statistics and displays for each column of data overview, duplicate rows, missing values, and correlation coefficients.
Visualize
Take an example of data visualization .
Describe
The above picture is the Describe function in the Visualize menu , which can count the most value, mean, standard deviation and other indicators of each column, and provide a chart display.
On the right Code Export
you can see the code that generates this data.
4.3 Highlight
Highlight missing values and outliers, so that we can quickly locate abnormal data.
Highlight
The image above shows the sepal_width
outliers for the fields.
dtale
It is very powerful and has many functions. You can explore and dig a lot.
Finally, a brief summary. If the data set to be explored focuses on data display, you can choose PandasGUI
; if you simply understand the basic statistical indicators, you can choose Pandas Profiling
and Sweetviz
; if you need to do in-depth data exploration, then choose dtale
.
Technology Exchange
Welcome to reprint, collect, like and support!
At present, a technical exchange group has been opened, and the group has more than 2,000 members . The best way to remark when adding is: source + interest direction, which is convenient to find like-minded friends
- Method 1. Send the following picture to WeChat, long press to identify, and reply in the background: add group;
- Method ②, add micro-signal: dkl88191 , note: from CSDN
- Method ③, WeChat search public account: Python learning and data mining , background reply: add group