Data Visualization | That’s right! Plotting like ggplot2 can also be done in Python

For more details, please click to view data visualization | That’s right! Plotting like ggplot2 can also be done in Python

Table of contents

Part1Introduction

Part2 powerful plotnine library

Part3plotnine data visualization basics

1. Graphic grammar

2. Aesthetic mapping parameters

Part4 uses plotnine to realize data visualization

1. Dumbbell diagram

2. Slope map

Part5Summary


Part 1 Introduction

As the saying goes, “A picture is worth a thousand words.” Excellent data visualization can help you gain insight into the phenomena and patterns contained in the data. When we talk about data visualization, we will definitely mention the R language, which is the leader in the field of data visualization, especially the ggplot2package in R language. ggplot2The gg in R language represents the grammar of graphic . The development concept of this package is Use Grammar to draw. The author has learned about its powerful functions, and also wondered whether there are similar functions in Python?

The answer is yes, if you are better at using Python and have some requirements for visual functions, then the plotninelibrary is a good choice. It can almost be said to be ggplot2the Python version, and its syntax is ggplot2roughly the same. plotnineThe library was developed by Hassan Kibirige based on matplotlib, , pandasand other libraries. The developer's purpose is to provide Python users with a simple, flexible and powerful drawing library. Using plotninethe library, we can leverage ggplot2core concepts such as layers and mappings to create various types of charts. Below we introduce to you plotninethe basic usage of this library, and give the detailed process of two examples based on this library.

All Python code in this article was written in the integrated development environment Visual Studio Code (VScode) using the interactive development environment Jupyter Notebook.

Part 2 Powerful plotnine library

plotnineThe library matplotlibis different from the complex drawing rules of the underlying library. The introduction of graphics syntaxplotnine makes the style of the drawing code more elegant and regular. We can build graphics in a logical order; at the same time, because the library uses a layer system to build graphics, we can be more Flexibly create various types of charts and customize them.

The figure below shows plotnineseveral types of data visualizations we have drawn using the library.

picture

Figure 1 Slope map

picture

Figure 2 Violin diagram

picture

Figure 3 Kernel density estimation diagram

picture

Figure 4 Percentage stacked bar chart

picture

Figure 5 Scatter plot of column facets and fitted curves

picture

Figure 6 Dumbbell diagram

The above figure shows six types of visual graphics. Although they are of different categories, they are essentially drawn using graphics syntax, and the logic of the code is consistent. In plotninethe library, all charts are stacked by layers, and each layer is +connected with a plus sign ( ) . This is why its drawing syntax is very smooth and more readable.

Due to space reasons, we focus on how to draw the dumbbell and slope graphs shown below, and in the process explain how to use the plotninelibrary to implement data visualization.

picture

Figure 7 Example of dumbbell chart

picture

Figure 8 Example of slope chart

First, we need to understand what information can dumbbell plots and slope plots tell us about the data presented to us respectively?

Dumbbell plot is one of the visual charts that displays data differences . It is suitable for comparing the differences or changes between two different groups , such as showing the comparison of two time points or two groups. The shape of the dumbbell diagram is similar to a dumbbell, with two circles at both ends, and a line segment in the middle. The line segment represents the gap between the two groups , and the circle represents the specific value of each group . The dumbbell chart can compare the data of two groups, so as to visually see the difference in data between the two groups under the same classification.

A slope chart can be thought of as a line chart with multiple groups. When there are many groups, the traditional line chart will have problems such as lines being difficult to distinguish and groups being confused due to overlapping polylines; while the slope chart emphasizes the starting point and end point of each polyline , which greatly reduces the problem of increasing groupings. At the same time , the slope chart retains the function of the traditional line chart. The line between the two endpoints of each group can well reflect the change process and trend of the data.

Before drawing graphics, let's first understand plotninethe basics of library drawing (if you want to skip the basics, you can go to Chapter 4).

Part 3 plotnine data visualization basics

1. Graphic grammar

First, let's look at plotninethe underlying logic of library drawing. Graphic grammar has two major characteristics: first, using the layer drawing design method, a complex chart is split into multiple layers during the drawing process . Starting from the function, each layer is connected ggplot()using a plus sign ( ) +, the further back the layer is, the higher it is. Second, the core of data visualization is data. Using graphics syntax can separate data and graphic details when drawing , and define graphic details separately. This separation method makes drawing more flexible. After adjusting the data format, we only need to focus on to the look and presentation of graphics without having to think twice about processing and transforming the data.

plotnineThe necessary information that needs to be entered when drawing is as follows:

  1. ggplot(data, mapping = aes()): The lowest level function, marking the beginning of drawing. The parameters datarepresent the data used, usually in DataFrame format; the parameters mappingrepresent the default aesthetic mapping in drawing, specified using the aes() format, which can be used to represent variables, control color, size, etc.

  2. geom_*()Or stat_*(): the former represents geometric shapes, and the latter represents statistical transformations, both of which can create a layer. You can use it geom_*()to draw most charts, such as geom_point()scatter charts, geom_line()line charts, etc. geom_*()Statistical transformation can also be achieved by setting parameters in . statThe default value of this parameter is no transformation. Use stat_*()will implement statistical transformation according to the function. At this time, geomthe type of chart to be drawn must be specified in the parameter.

It should be noted that the effects of pairs appearing geom_*()and stat_*()function implementation are usually the same, such as:

# 1. 使用 geom_*() 并指定参数 stat 
(ggplot(df, aes(x='class', y='value')) + 
 geom_point(stat='summary', ……))

# 2. 使用 stat_*() 并指定参数 geom 
(ggplot(df, aes(x='class', y='value')) + 
 stat_summary(geom='point', ……))

The functions implemented by the above codes are consistent. Both indicate that the chart type drawn is a scatter chart and the statistical transformation method is summary.

Since the functions are the same, why distinguish the two? This may seem unnecessary, but in fact the two have different focuses. geom_*()The type of chart to be drawn is specified (for geom_point()example, a scatter plot is drawn). Under this certain type, different scatter plots can be obtained by specifying different statistical transformation methods, that is, the focus of this type of function is geometric figures ; stat_*()specify Under this statistical transformation stat_summary(), different types of charts can be obtained by specifying different geometric figures, that is, the focus of this type of function is statistical transformation .

In addition, plotninesome optional information can also be entered in the drawing, as shown in the following table:

Optional information effect
scale_*() Set metrics to control the scale of mapping from data to aesthetics. Functions in this class receive and adjust data to suit different aspects of visual perception, such as length, color, size, and shape.
theme() Setting the theme is mainly used to adjust the details of the chart, such as background, grid lines, colors, etc.
coord_*() When setting the coordinate system, one thing to note is that unlike ggplot2, plotnine currently only supports the Cartesian coordinate system and does not support the implementation of polar coordinate systems and geospatial coordinate systems.

2. Aesthetic mapping parameters

When we introduced graphic grammar in the previous section, we mentioned ggplot()an important parameter in the equal function mapping = aes(). This parameter represents the mapping relationship from data to aesthetics in visualization and is the core of visualization. The mapping relationship from data to aesthetics is included in aes() in the form of various parameters. It plotnineprovides a large number of aesthetic mapping parameters, which can flexibly realize various drawing needs of users. Commonly used aesthetic mapping parameters are as follows:

Aesthetic mapping parameters effect
x Set the data for the x-axis
y Set the data for the y-axis
color
/colour
Set the color of point, line, and filled area outlines
fill Set the color of the filled area
alpha Set the color transparency, the value is between 0-1, 0 means completely transparent, 1 means completely opaque
shape Set point shape
linetype Set line type
size Set the size of the object

There are many parameters shapeto choose from. The most commonly used one is shape='o'to draw circular points. Other common choices are as follows:

symbol point shape
. point marker
, pixel marker
s square marker
p pentagon marker
* star marker

linetypeSome common choices for parameters are as follows:

symbol Line type
blank White line
solid solid line
dashed short dashed line
dotted Dotted line

It is worth noting that aesthetic mapping parameters may not be universal in different functions. For example, geom_point()setting parameters when drawing a scatter plot linetypeis naturally invalid.

Since visualization involves a lot of content, this article cannot cover everything. It is more about helping readers get started with graphic grammar visualization. If you have further drawing needs, you can check the official documentation of plotnine.

Part 4 Use plotnine to realize data visualization

We have introduced the basic content above, now let's look at how to use plotnine to draw dumbbell diagrams and slope diagrams.

1. Dumbbell diagram

The data used in this section is " 2022 年各省绿色农业企业数量统计.xlsx", which contains the number of entries and exits of green agricultural enterprises in 31 provinces (municipalities). When opened using WPS, it is shown in the figure below:

picture

This data comes from Zhejiang University Carter-Enterprise Research China Agricultural Research Database (CCAD) . For more details, please go to

https://r.qiyandata.com/data_centre/CCAD

" 2022 年各省绿色农业企业数量统计.xlsx" comes from the "Agricultural-related Subjects-Featured Statistical Database" in CCAD. CCAD (full name: Zhejiang University Carter-Qiyan China Agricultural Research Database) is jointly launched by QiYan Data and Zhejiang University China Rural Development Research Institute ("Carter") to support the national rural revitalization development strategy and serve "agriculture, rural areas and farmers" A large database of agricultural-related research created with great efforts in academic research and think tank construction in related fields.

>>> Click here to view more introduction to CCAD

First, we import the data and sort the data in ascending order according to the "Enterprise Entry Quantity" field, and use a function pd.categorical()to convert the "Province" field into a Categorical type. The code is as follows:

df = df.sort_values(by='企业进入数量', ascending=True)
# pd.categorical(values, categories=None, ordered=None)
df['省份'] = pd.Categorical(df['省份'], categories=df['省份'], ordered=True)

pd.categorical()The parameter valuesrepresents the data to be converted to a classification type, which can be a list, array or Series. The parameter is categoriesused to specify the category. If not specified, it will be automatically created based on the unique value in the data; the parameter orderedis used to specify whether the classification type is ordered.

Since the number of enterprise entries and exits needs to be mapped to color filling when drawing, the original wide table data needs to be converted into a long table. The code is as follows:

df = pd.melt(df, id_vars=['省份', '年份'], var_name='Enter_or_Quit', value_name='Number')
df['Enter_or_Quit'] = df['Enter_or_Quit'].map(lambda x: 'Enter' if x == '企业进入数量' else 'Quit')

At this point, we have completed processing the original data. Let's start the drawing operation, the code is as follows:

plot = (# 基础图层, 指定数据并设定默认的美学映射参数
        ggplot(df, aes(x='Number', y='省份', fill='Enter_or_Quit')) +
        # 绘制几何对象
        geom_line(aes(group='省份')) +
        geom_point(shape='o', size=3, color='black') +
        scale_fill_manual(values=('#00AFBB', '#FC4E07')) +
        # 调整主题和细节
        theme_classic() +
        theme(text=element_text(family='SimHei')))
print(plot)
  • ggplot()The function defines the base layer, marking the start of drawing, setting the "Number" field as the x-axis, the "Province" field as the y-axis, and the "Enter_or_Quit" field mapping to color fill;

  • geom_line()The function draws a polyline layer. In this layer, the aesthetic mapping parameter group is specified as the "province" field, so multiple polylines will be drawn according to the content of "province";

  • geom_point()The function draws a scatter layer with a circular shape, a size of 3, and a black outline;

  • scale_fill_manual()The function sets the fill measurement, and the parameter valuesspecifies the "Enter_or_Quit" field mapped to the specific color after color filling, '#00AFBB'and '#FC4E07'the sum is the hexadecimal color code. At this point, the geometric object we draw is completed.

The color matching of data visualization is actually a very important part, and it also involves a lot of content. Due to limited space, I will not go into details here. We recommend two websites where interested readers can refer to learn more about color matching:

  1. Introduction to hexadecimal color codes: https://www.mathsisfun.com/hexadecimal-decimal-colors.html

  2. Online color simulation: http://www.ku51.net/color/rgb.html

The functions in the last two lines of code theme_classic()represent the use of the classic theme, and the functions theme(text = element_text(family = 'SimHei'))further set up the details of the theme. Since Chinese characters are involved in the drawing elements, you should specify that the characters in the chart use boldface, otherwise they will not be displayed properly. Now we get the chart results drawn as shown below:

picture

In the picture above, the red scatter points represent the number of exited green agricultural enterprises, the blue scatter points represent the number of entering enterprises, and the line segment connecting the two represents the difference between entering enterprises and exiting enterprises. In this picture, it represents the year 2022 The province has a net increase in green agricultural enterprises. Through this dumbbell chart, we can clearly see that the net added value of green agricultural enterprises in Anhui, Henan, Yunnan and Hunan provinces is at the forefront of the country; Tibet Autonomous Region, Shanghai, Tianjin and Qinghai and other provinces The number of entry and exit numbers of green agricultural enterprises (municipality) are both at a low level.

2. Slope map

The data used in this section is " 历年各省农村金融机构网点数量统计.xlsx", which contains the number of rural financial institution outlets in 9 provinces from 2015 to 2022. After opening using WPS, it is shown in the following figure:

picture

This data comes from Zhejiang University Carter-Enterprise Research China Agricultural Research Database (CCAD) . For more details, please go to

https://r.qiyandata.com/data_centre/CCAD

" 历年各省农村金融机构网点数量统计.xlsx" comes from the "rural finance" sub-library in CCAD. CCAD (full name: Zhejiang University Carter-Qiyan China Agricultural Research Database) is jointly launched by QiYan Data and Zhejiang University China Rural Development Research Institute ("Carter") to support the national rural revitalization development strategy and serve "agriculture, rural areas and farmers" A large database of agricultural-related research created with great efforts in academic research and think tank construction in related fields.

>>> Click here to view more introduction to CCAD

Since the data itself is a long table data, in subsequent drawings we want to use different colors to distinguish the polylines representing provinces with an increasing number of provinces from those with a decreasing number of provinces , so we need to first convert the original data to For short table data, construct an "up_or_down" field based on the number of rural financial institution outlets in 2015 and 2022, and then convert it back to the long table:

df = pd.read_excel('.\\Data\\历年各省农村金融机构网点数量统计.xlsx')
# 将原数据转换为短表
df = pd.pivot(df, index='省份', values='网点数量', columns='年份')
df.reset_index(inplace=True)
df.columns
df.columns = ['省份', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022']

# 增加一个用于线条颜色分类的字段
df['up_or_down'] = df.apply(lambda x: 'Increase' if x['2015'] < x['2022'] else 'Decrease', axis=1)
# 将数据转换回长表
df = pd.melt(df, id_vars=['省份', 'up_or_down'], var_name='年份', value_name='网点数量')
df['年份'] = df['年份'].astype('int32')

Since we need to display the starting year and final year numbers at both ends of the drawn graph, we need to extract the marked content first and store it in a variable. The code is as follows:

left_label = df.apply(lambda x: x['省份'] + ',' + str(x['网点数量']) if x['年份'] == 2015 else '', axis=1)
right_label = df.apply(lambda x: str(x['网点数量']) + ',' + x['省份'] if x['年份'] == 2022 else '', axis=1)

left_point = df.apply(lambda x: x['网点数量'] if x['年份'] == 2015 else np.nan, axis=1)
right_point = df.apply(lambda x: x['网点数量'] if x['年份'] == 2022 else np.nan, axis=1)

At this point, our original data has been processed, and we can start the drawing operation.

Since the slope map has more details than the dumbbell map, the code implementation process will be more complicated than the previous section, but the drawing principle is still the superposition of multiple layers, and it is not difficult to understand. The implementation code can be divided into three parts: drawing geometric objects, annotating text content, and adjusting themes and details (the length here is limited, and it can be combined into a complete code for actual use). Another small advantage mentioned here by the way,plotnine because in Each layer in the graphics syntax is +connected, so long codes can be written in pieces, or you can continue to +add new details after the diagram is completed. code show as below:

# 第一部分:绘制几何对象
plot = (# 基础图层, 指定数据并限制 y 轴长度
        ggplot(df, aes(yend=max(df['网点数量'])*1.02)) +
        # 绘制几何对象
        geom_line(aes(x='年份', y='网点数量', group='省份', color='up_or_down'), size=0.75) +
        scale_color_manual(values=('#FF4040', '#43CD80')) +
        geom_vline(xintercept=2015, linetype='solid', size=0.1) +
        geom_vline(xintercept=2022, linetype='solid', size=0.1) +
        geom_point(aes(x='年份', y=left_point), size=3, shape='o', fill='grey', color='black') +
        geom_point(aes(x='年份', y=right_point), size=3, shape='o', fill='grey', color='black') +
        xlim(2012, 2025))

The first step is to complete the drawing of the geometric object. We still use ggplot()the function to draw the base layer, specify the data used in it, and use parameters in the aesthetic parameters yendto set the upper limit of the y-axis to 1.02 times the maximum value of the "number of dots";

geom_line()The function draws the polyline layer. We use the "province" field grouping to map "up_or_down" to the color of the line, and then specify the scale_color_manual()specific color;

Next, geom_vline()the function is used twice to draw two vertical lines perpendicular to the x-axis at x=2015 and x=2022. The line type is a solid line; then the function is used twice to geom_point()draw the left endpoint and At the right endpoint, the last line of code xlim()specifies the range of the x-axis. The range of the x-axis is set here to be 2012 ~ 2025, which is slightly wider than the original data range and is more conducive to placing the label content on both sides. At this point, the drawing of geometric objects is completed.

The second step is to annotate the text content. code show as below:

# 第二部分:添加文字描述
plot = (plot +
        geom_text(label=left_label, x=2014, y=df['网点数量'], size=8, ha='right') +
        geom_text(label=right_label, x=2023, y=df['网点数量'], size=8, ha='left') +
        geom_text(label='2015', x=2015, y=max(df['网点数量'])*1.02, size=12) +
        geom_text(label='2022', x=2022, y=max(df['网点数量'])*1.02, size=12))

In the above code, the first two uses of the function geom_text()are to add text labels on both sides. haThe function of the parameters is to define the alignment . The left label is set to right alignment, the right label is set to left alignment, and the x coordinates of the labels are respectively larger than The difference between 2015 and 2022 is 1, and the y coordinate is equal to the y coordinate of the scatter point; the last two uses of the function geom_text()add the time represented by the vertical lines on the left and right sides, and its y coordinate is located at the upper limit of the y axis.

In the third step, we use theme_void()the designated concise theme. code show as below:

# 第三部分:调整主题和细节
plot = (plot +
        theme_void() +
        theme(panel_background=element_rect(fill="white"),
              legend_position='none',
              axis_text=element_blank(),
              axis_title=element_blank(),
              axis_ticks=element_blank(),
              text=element_text(family='SimHei')))

In the above code, theme()the parameters in the function specify that the background is white, and the parameters specify that the legend is not displayed. The , , and three parameters control the labels, titles, and tick marks of the coordinate axis respectively. Here, they are all specified as not to be displayed. Finally, we get the graph drawn as shown below:panel_backgroundlegend_positionaxis_textaxis_titleaxis_ticks

picture

In this slope chart, the green polyline represents the provinces where the number of rural financial institution outlets has increased from 2015 to 2022, the red represents the provinces where the number has decreased, and the lines between the endpoints represent changes in the number of rural financial institution outlets.

Part 5 Summary

Presumably after studying this article, you should realize plotninethe power of the library. plotnineThe emergence of the library has greatly strengthened the data visualization function of Python. Mastering plotninethe graphical syntax and layer-based drawing thinking can help us realize data elegantly and accurately. Visualization. Of course, the author believes that data visualization is easy to learn but difficult to master. Considering the space, some details cannot be presented one by one in this article. We hope that this article can stimulate the interest of readers and bring some help to everyone's data visualization work. See you next time.

picture

picture

Recommended in the past

Python Practical Combat | ChatGPT + Python realizes fully automatic data processing/visualization

Python practice | How to use Python to call API

Python Tutorial | List Comprehension & Dictionary Comprehension

Python teaching | Pandas time data processing method

Python teaching | Pandas function application (apply/map) [Part 2]

Guess you like

Origin blog.csdn.net/weixin_55633225/article/details/132556795