A complete collection of visualization tools for Pandas DataFrame

13a182ddfd67e50c72323fcced856983.png

Introduction

One of the benefits of Excel is that it provides an intuitive and powerful graphical interface to view your data. In contrast, pandas + Jupyter notebook provides a lot of programming power, but limited ability to graphically display and manipulate DataFrame views.

In the Python ecosystem, there are several tools designed to fill this gap. They range in complexity from simple JavaScript libraries to complex, full-featured data analysis engines. One thing in common is that they all provide a way to view and optionally filter data in a graphical format. Starting from this common ground, they differ greatly in design and function.

This article will review a few of these DataFrame visualization options in order to give you an overview and evaluate which ones might be useful to your analysis process.

background introduction

For this article, we will use a sample sales dataset. Below is a view of the data in a Jupyter notebook.

import pandas
url = 'https://github.com/chris1610/pbpython/blob/master/data/2018_Sales_Total_v2.xlsx?raw=True'
df = pd.read_excel(url)
df

ac648f1d9368fef8d2434c6b67bb5063.png

Below is a similar view in Excel with filters applied to all columns.

77e423ba3fb231ed7dd01bbe374a8e2d.png

This familiar view in Excel makes it easy to see all your data. You can inspect data by filtering and sorting, and drill down to details when needed. This type of functionality is most useful when you are exploring a new dataset or solving a new problem with an existing dataset.

Obviously, this is not feasible for data with millions of rows. However, even if you have large datasets and are a pandas expert, hopefully you will still dump DataFrames into Excel and look at subsets of the data.

Part of the reason I use Excel+Python is that the ad hoc ability to inspect data in Excel is so much better than the normal DataFrame view.

With this background, let's look at some options for replicating this simple viewing ability in Excel.

javascript tools

The easiest way is to use a JavaScript library to add some interactivity to the DataFrame view in Jupyter notebook.

Qgrid

The first tool we'll look at is Qgrid from Quantopian. This Jupyter notebook widget uses the SlickGrid component to add interactivity to your DataFrame.

Once it is installed, you can display a version of the DataFrame that supports sorting and filtering data.

import qgrid
import pandas
url = 'https://github.com/chris1610/pbpython/blob/master/data/2018_Sales_Total_v2.xlsx?raw=True'
df = pd.read_excel(url)
widget = qgrid.show_grid(df)
widget

4c54587d4f2be98c0e12495dd12393bd.png

Qgrid supports intuitive filtering using various widgets based on the underlying data type. Additionally, you can configure some rendering functions and then read selected data into a DataFrame. This is a pretty useful feature.

Qgrid does not do any visualization, nor does it allow you to use pandas expressions to filter and select data.

Overall, Qgrid works well for simple data manipulation and inspection.

PivottableJs

The next option isn't meant for viewing DataFrames, but I think it's a very useful tool for summarizing data, so I'm including it.

The pivottablejs module uses a pivot table JavaScript library for interactive pivoting and summarizing.

Once it's installed, it's simple to use.

from pivottablejs import pivot_ui
pivot_ui(df)

In this example, I've summed up the number of purchases for each customer by clicking and dragging.

07b55971fc65d448caccc8064ece4194.png

Besides the basic sum function, you can also do some visualization and statistical analysis.

8eeef3bcae2a8d04f249b4eb4b5565bf.png

This widget is not useful for filtering raw DataFrames, but is really powerful for pivoting and summarizing data. One of the nice features is that once you build the pivot table, you can filter the data.

Another downside of this widget is that it doesn't take advantage of any of pandas' perspective or selection features. Still, pivottablejs is a very useful tool for quick pivoting and summarizing.

Data Analysis Application

The second type of GUI application is a full-fledged application, usually using a web backend such as Flask or a standalone application based on Qt. These applications vary in complexity and capabilities, from simple tabular views and graphing capabilities to powerful statistical analysis. A unique aspect of these tools is that they are tightly integrated with pandas, so you can use pandas code to filter data and interact with these applications.

PandasGUI

The first application I will discuss is PandasGUI. What makes this application unique is that it is a standalone application built with Qt that can be called from a Jupyter notebook.

Using the same data as in the previous example, import showthe command .

from pandasgui import show
show(df)

If all goes well, you'll end up with a standalone GUI. Because it's a standalone application, you can configure the view quite a bit. For example, I moved several tabs to show more capabilities on one page.

In this example, I'm using pandas query syntax to filter the data to show a customer and purchases > 15.

b496a8529f946890ff5c748e398a82f8.png

PandasGUI is integrated with Plotly and can also create visualizations. Below is an example of a unit price histogram.

11f2b6b61c378cf03b9275db65366668.png

A nice feature of PandasGUI is that filters work on DataFrames in all tabs. You can use this feature to experiment with different views of the data when plotting or transforming the data.

Another feature of PandasGUI is that you can reshape the data by pivoting or blending. Below is a summary of unit sales by SKU.

c901b7fed8da8c92602d1afe5c323626.png

Here's what the resulting view looks like.

d90fca576ea9eb3cf5695bd457ffe1e7.png

PandasGUI is an impressive application. I like how it keeps track of all changes and is just a small wrapper around standard pandas functionality. This program is under active development, so I'll be keeping an eye on it to see how it improves and develops over time.

Tableau

The name Tabloo makes me smile every time I see it. Hopefully a BI business visualization tool isn't put off by the similarity of the name!

Regardless, Tabloo provides a simple visualization tool for DataFrames using a Flask backend, as well as plotting capabilities similar to PandasGUI.

Using Tabloo is very similar to PandasGUI.

import tabloo
tabloo.show(df)

e9484f99ceb5ea7c9520312d496523d2.png

Tabloo uses a query syntax like PandasGUI, but I can't figure out how to add multiple filters like PandasGUI does.

Finally, Tabloo does have some basic plotting functionality too, but it's not as rich as PandasGUI.

6cb8c8380f1128ea2dbd84847fbe6e1b.png

Tabloo has some interesting concepts, but doesn't have the same capabilities as PandasGUI. It hasn't been updated in a while, so it might be dormant, but I wanted to include it to get as complete a survey as possible.

Date

The last application is Dtale, which is the most complex option. Dtale has a similar architecture to Tabloo, using a Flask backend, but also includes a powerful React frontend. Dtale is a mature project with a lot of documentation and a lot of features. I'll cover only a small subset of the features in this post.

Getting started with Dtale is similar to other apps in this category.

import dtale
dtale.show(df)

47eefb7f7292078ca153d387c50ab4d3.png

This view gives you a hint that Dtale is not just a data frame viewer. It is a very powerful set of statistical tools. I can't cover all the enhancements here, but here's a quick example showing a histogram of the unit price column.

efccfd9a4a42b8546733a98f4401a07f.png

One feature of Dtale that I really like is that you can export the code and see what it's doing. This is a very powerful feature that differentiates an Excel+Python solution from plain Excel.

Below is an example of exporting code from the visualization above.

# DISCLAIMER: 'df' refers to the data you passed in when calling 'dtale.show'
import numpy as np
import pandas as pd
if isinstance(df, (pd.DatetimeIndex, pd.MultiIndex)):
    df = df.to_frame(index=False)
# remove any pre-existing indices for ease of use in the D-Tale code, but this is not required
df = df.reset_index().drop('index', axis=1, errors='ignore')
df.columns = [str(c) for c in df.columns]  # update columns to strings in case they are numbers
s = df[~pd.isnull(df['{col}'])][['{col}']]
chart, labels = np.histogram(s, bins=20)
import scipy.stats as sts
kde = sts.gaussian_kde(s['unit price'])
kde_data = kde.pdf(np.linspace(labels.min(), labels.max()))
# main statistics
stats = df['unit price'].describe().to_frame().T

Regarding the problem of filtering data, Dtale also allows you to format the data. In the example below, I've formatted the currency and date columns to make them easier to read.

280f2fefc93c9ffe3fff6c688cf69012.png

As I said earlier, Dtale is a powerful tool with many capabilities. If you're interested, I encourage you to check it out and see if it works for you.

One aspect to be aware of is that you may run into Windows Firewall issues when trying to run Dtale. On a closed corporate machine, this can be a problem. For more details on the various installation options, please refer to the documentation.

Regardless of the question, I think it's definitely worth checking out Dtale, if only to see all the features you can use.

IDE variable viewer

If you're developing in a tool like VS Code or Spyder, you can use a simple DataFrame variable viewer.

For example, here's a look at our DataFrame using Spyder's Variable Explorer.

d5eb9f768905f049cb94c5ff672fbdbe.png

This viewer is very handy if you use Spyder. You don't have any ability to filter the data in the GUI, but you can change the sort order.

VS Code also has a similar feature. Below is a simple view showing how you can filter the data.

2b834c8c791314fe1495c67f8d99e079.png

These features are useful if you are already working in Spyder or VS code. However, when it comes to complex filtering or sophisticated data analysis, they have little to no Dtale's capabilities.

But I hope VS Code will continue to improve their DataFrame viewer. It seems like VS Code can do pretty much anything these days, so I'm interested to see how this feature evolves.

PyXLL

The aforementioned article requires the PyXLL package, which is a commercial application. I have no problem with a company developing a commercial product. I think this is critical to the success of the Python ecosystem. However, a paid option means you may need to get more support to introduce it to your organization. Luckily, you can try it for free for 30 days to see if it fits your needs.

That caveat aside, let's try it with our example dataset.

97e172480fce279a8725857ff215dcf2.png

What's really powerful is that you can have the notebook side by side with Excel, and use jupyter magic commands to exchange data between the notebook and Excel. In this example, using %xl_set dfwill put the DataFrame directly into the Excel file. Then you can work with Excel in mixed mode.

PyXLL has a lot of different features for integrating Python and Excel, so it is difficult to compare it with the previously discussed frameworks. Overall, I like the idea of ​​using Excel's visual components coupled with the power of Python programming. If you're interested in this combination of Python and Excel, you should definitely take a look at PyXLL.

xlwings

xlwings has been around for a while, and like PyXLL, xlwings is also backed by a commercial company. However, it has an open-source Community Edition, as well as a paid Professional Edition. The examples here use the Community Edition. The full professional xlwings package has several different features to integrate Excel and Python.

While xlwings doesn't integrate directly with Jupyter notebooks, you can populate Excel spreadsheets with DataFrames in real-time and use Excel for analysis.

Below is a short code snippet.

import pandas as pd
import xlwings as xw
url = 'https://github.com/chris1610/pbpython/blob/master/data/2018_Sales_Total_v2.xlsx?raw=True'
df = pd.read_excel(url)
# Create a new workbook and add the DataFrame to Sheet1
xw.view(df)

This code will open a new instance of Excel and put df into cell A1. Here's what it looks like.

438cb69d577d8f63fd7d0dc1d08a7f92.png

This can be a quick shortcut instead of saving and reopening Excel to view your data. It's actually pretty simple to do, so I'll probably try it a little more in my own data analysis.

Summarize

This article has covered a lot. Here's a picture summarizing all the options we've discussed.

aec6c7ebd9a95aac5d1827556262e6e6.png

Is there a solution that works for everyone? I do not think so. Part of the reason I wanted to write this post was that I wanted to spark a discussion about the "best" solution. I hope you will take this opportunity to look at some of these solutions and see if they fit your analysis process. Each of these solutions addresses different aspects of the problem in different ways. I would guess that users are likely to combine several of these - depending on the problem they are trying to solve.

I predict that we will continue to see the evolution of this field. I hope we can find a solution that leverages some of the interactive and intuitive aspects of Excel, as well as the power and transparency associated with data manipulation with Python and pandas. With Guido van Rossum joining Microsoft, maybe we'll see more progress in this area?

92a5a78459d6d95ca158c8d07b3d76ac.jpeg

Guess you like

Origin blog.csdn.net/BF02jgtRS00XKtCx/article/details/125611149