Fully automated exploratory data analysis with just a few lines of Python code

Exploratory data analysis is one of the important components of data science model development and dataset research. When a new data set is obtained, it first takes a lot of time to conduct EDA to study the inherent information in the data set. The automated EDA Python package can perform EDA with a few lines of Python code. In this article, 10 Python packages that can automatically execute EDA and generate insights about data are compiled to see what functions they have and how much they can help us automate EDA needs.

  1. DTale
  2. Pandas-profiling
  3. sweetviz
  4. auto vision
  5. dataprep
  6. KLib
  7. table
  8. speedML
  9. dated
  10. edaviz

1、D-Tale

D-Tale uses Flask as the backend, React frontend and can be seamlessly integrated with ipython notebook and terminal. D-Tale can support Pandas DataFrame, Series, MultiIndex, DatetimeIndex and RangeIndex.

import dtale
import pandas as pd
dtale.show(pd.read_csv("titanic.csv"))

picture

The D-Tale library can generate a report with a single line of code, which contains an overall summary of the dataset, correlations, graphs and heatmaps, highlights missing values, etc. D-Tale can also analyze each chart in the report. In the screenshot above, we can see that the chart can be interactively operated.

2、Pandas-Profiling

Pandas-Profiling can generate summary reports of Pandas DataFrame. pandas-profiling extends pandas DataFrame df.profile_report() and works really well on large datasets, it can create reports in seconds.

#Install the below libaries before importing
import pandas as pd
from pandas_profiling import ProfileReport

#EDA using pandas-profiling
profile = ProfileReport(pd.read_csv('titanic.csv'), explorative=True)

#Saving results to a HTML file
profile.to_file("output.html")

picture

3、Sweetviz

Sweetviz is an open-source Python library that generates beautiful visualizations with just two lines of Python code, launching EDA (Exploratory Data Analysis) as an HTML application. The Sweetviz package is built around quickly visualizing target values ​​and comparing datasets.

import pandas as pd
import sweetviz as sv

#EDA using Autoviz
sweet_report = sv.analyze(pd.read_csv("titanic.csv"))

#Saving results to HTML file
sweet_report.show_html('sweet_report.html')

The reports generated by the Sweetviz library contain an overall summary of datasets, correlations, categorical and numerical feature associations, etc.

picture

4、AutoViz

The Autoviz package can automatically visualize datasets of any size with one line of code, and automatically generate reports in HTML, bokeh, etc. Users can interact with the HTML reports generated by the AutoViz package.

import pandas as pd
from autoviz.AutoViz_Class import AutoViz_Class

#EDA using Autoviz
autoviz = AutoViz_Class().AutoViz('train.csv')

picture

5、Dataprep

Dataprep is an open source Python package for analyzing, preparing and processing data. DataPrep is built on top of Pandas and Dask DataFrame and can be easily integrated with other Python libraries.

DataPrep is the fastest of the 10 packages, generating reports for Pandas/Dask DataFrame in seconds.

from dataprep.datasets import load_dataset
from dataprep.eda import create_report

df = load_dataset("titanic.csv")
create_report(df).show_browser()

picture

6、Club

klib is a Python library for importing, cleaning, analyzing and preprocessing data.

import klib
import pandas as pd

df = pd.read_csv('DATASET.csv')
klib.missingval_plot(df)

picture

klib.corr_plot(df_cleaned, annot=False)

picture

klib.dist_plot(df_cleaned['Win_Prob'])

picture

klib.cat_plot(df, figsize=(50,15))

picture

Although klibe provides a lot of analysis functions, we need to manually write code for each analysis, so it can only be said to be a semi-automatic operation, but it is very convenient if we need more customized analysis.

7、Table

Dabl focuses less on statistical measures of individual columns and more on providing a quick overview through visualizations, and convenient machine learning preprocessing and model search.

The Plot() function in the image
dabl can be visualized by drawing various graphs, including:

  • target distribution map
  • Scatterplot
  • Linear Discriminant Analysis
import pandas as pd
import dabl

df = pd.read_csv("titanic.csv")
dabl.plot(df, target_col="Survived")

picture

8、Speedml

SpeedML is a Python package for jumpstarting machine learning pipelines. SpeedML integrates some commonly used ML packages, including Pandas, Numpy, Sklearn, Xgboost and Matplotlib, so in fact SpeedML not only includes the function of automated EDA.

SpeedML officials say that using it can be developed on an iterative basis, reducing coding time by 70%.

from speedml import Speedml

sml = Speedml('../input/train.csv', '../input/test.csv',
            target = 'Survived', uid = 'PassengerId')
sml.train.head()

picture

sml.plot.correlate()

picture

sml.plot.distribute()

picture

sml.plot.ordinal('Parch')

picture

sml.plot.ordinal('SibSp')

picture

sml.plot.continuous('Age')

picture

9、DataTile

DataTile (formerly known as Pandas-Summary) is an open source Python package responsible for curating, summarizing and visualizing data. DataTile is basically an extension of the PANDAS DataFrame describe() function.

import pandas as pd
from datatile.summary.df import DataFrameSummary

df = pd.read_csv('titanic.csv')
dfs = DataFrameSummary(df)
dfs.summary()

picture

10、edaviz

edaviz is a python library that can be used for data exploration and visualization in Jupyter Notebook and Jupyter Lab. It was very easy to use, but was later acquired by Databricks and integrated into bamboolib, so here is a simple one. demo.

picture

Summarize

Dataprep is my most commonly used EDA package. AutoViz and D-table are also good choices. If you need customized analysis, you can use Klib. SpeedML integrates many things. Using it alone for EDA analysis is not particularly suitable. Others The package can be selected according to personal preference, but it is still very useful. In the end, edaviz should not be considered, because it is no longer open source.

Guess you like

Origin blog.csdn.net/weixin_52051554/article/details/130301548