Seaborn Data Visualization Fundamentals

Table of contents

introduce

knowledge points

Introduction to Seaborn

Quickly optimize graphics

Seaborn drawing API

1. Scatter plot:

parameter hue

hue+ hue_order

parameter style 

2. Line graph

3. Category diagram

Draw a boxplot

Draw a violin plot

Draw Augmented Boxplots

Draw a point-and-line graph

draw a bar graph

Draw a bar chart of counts

4. Distribution map

5. Regression diagram

6. Matrix diagram

homework

Experiment summary


introduce

        Matplotlib is an open source drawing library that supports the Python language. Because it supports rich drawing types, simple drawing methods and comprehensive interface documents, it is very popular among Python engineers, scientific researchers, data engineers and other people. Seaborn is a high-level drawing library based on Matplotlib, which can draw more beautiful graphics without complicated customization, and is very suitable for data visualization exploration.

knowledge points

  • Association diagram
  • Category diagram
  • Distribution
  • Regression graph
  • matrix diagram
  • combination chart

Introduction to Seaborn

        Matplotlib should be the best drawing library based on the Python language, but it also has a very troublesome problem, that is, it is too complicated. More than 3,000 pages of official documents, thousands of methods and tens of thousands of parameters are typical. You can do anything with it, but you can't start. Especially, when you want to call up very beautiful effects through Matplotlib, it is often a headache and very troublesome.

        Seaborn provides a higher-level API package based on the Matplotlib core library, allowing you to easily draw more beautiful graphics. The beauty of Seaborn is mainly reflected in more comfortable color matching and more delicate styles of graphic elements . The following is the official reference image given by Seaborn.

Seaborn has the following features:

  • Built-in several optimized style effects.
  • The color palette tool is added to easily match colors for data.
  • Plotting univariate and bivariate distributions is simpler and can be used to compare subsets of data to each other.
  • It is easier to fit and visualize regressions on independent and dependent variables.
  • Visualize data matrices and analyze them using clustering algorithms.
  • Based on time series drawing and statistical functions, more flexible uncertainty estimation.
  • Draw more complex image collections based on grids.

        In addition, Seaborn is highly compatible with the data structures of Matplotlib and Pandas, and is very suitable as a visualization tool in the process of data mining .

Quickly optimize graphics

        When we use Matplotlib to plot, the default image style is not very beautiful. At this point, fast optimization can be done using Seaborn. Next, we first use Matplotlib to draw a simple image.

import matplotlib.pyplot as plt
%matplotlib inline

x = [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]
y_bar = [3, 4, 6, 8, 9, 10, 9, 11, 7, 8]
y_line = [2, 3, 5, 7, 8, 9, 8, 10, 6, 7]

plt.bar(x, y_bar)
plt.plot(x, y_line, '-o', color='y')

         It's very simple to use Seaborn to quickly optimize images. Just place the style declaration code provided by Seaborn before drawing. sns.set() 

import seaborn as sns

sns.set()  # 声明使用 Seaborn 样式

plt.bar(x, y_bar)
plt.plot(x, y_line, '-o', color='y')

 

         We can find that, compared with Matplotlib's default pure white background, Seaborn's default light gray grid background looks more delicate and comfortable. The tone of the histogram and the font size of the coordinate axis also have some changes.

sns.set() The default parameters for are:

sns.set(context='notebook', style='darkgrid', palette='deep', font='sans-serif', font_scale=1, color_codes=False, rc=None)

in:

  • context='' The parameters control the default frame size , each with  {paper, notebook, talk, poster} four values. Among them, poster > talk > notebook > paper.
  • style='' The parameters control the default style , respectively , and you can change it yourself to see the difference between them. {darkgrid, whitegrid, dark, white, ticks}
  • palette='' The parameter is the preset color palette . There are respectively , etc., you can change and view the difference between them by yourself. {deep, muted, bright, pastel, dark, colorblind} 
  • font='' Used to set the font, font_scale= set the font size , color_codes= use the previous  'r' monochromatic abbreviation instead of the palette .

Seaborn drawing API

        Seaborn has a total of more than 50 API classes, compared to Matplotlib's thousands of scale, it can be regarded as short and dainty. Among them, according to the adaptation scene of the graphics, Seaborn's drawing methods are roughly classified into 6 categories, namely: association diagram, category diagram, distribution diagram, regression diagram, matrix diagram and combination diagram . These 6 categories contain different numbers of drawing functions.

        Next, we will demonstrate through actual data and use Seaborn to draw graphics for different adaptation scenarios.

Association diagram

        When we need to perform correlation analysis on data, we may use the following APIs provided by Seaborn.

Correlation analysis introduce
relplot Draw a relationship diagram
scatterplot Multi-dimensional analysis scatter plot
lineplot Multidimensional Analysis Line Chart

        relplot  is the abbreviation of relational plots, which can be used to present the relationship after the data, mainly in two styles: scatter plot and bar chart. In this experiment, we use the iris data set for drawing exploration.

        Before plotting, familiarize yourself with the iris iris data set. The dataset has a total of 150 rows and consists of 5 columns. Respectively represent: sepal length, sepal width, petal length, petal width, flower category. Among them, the first four columns are all numerical data, and the last column is classified into three types of flowers, namely: Iris Setosa, Iris Versicolour, and Iris Virginica.

# 从国内镜像下载 seaborn 数据集避免下一步加载数据集失败
!wget -nc "https://labfile.oss.aliyuncs.com/courses/2616/seaborn-data.zip"
!unzip seaborn-data.zip -d ~/

iris = sns.load_dataset("iris")
iris.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 silky
1 4.9 3.0 1.4 0.2 silky
2 4.7 3.2 1.3 0.2 silky
3 4.6 3.1 1.5 0.2 silky
4 5.0 3.6 1.4 0.2 silky

 At this point, we specify the characteristics of x and y, and a scatter plot can be drawn by default.

1. Scatter plot:

The plot produced by replot defaults to a scatter plot. The kind="scatter" in relplot(kind=" scatter ") can be omitted when drawing a scatter plot .

sns.relplot(x="sepal_length", y="sepal_width", data=iris)

         However, the above picture does not show the connection between data categories. It would be better if we add category features to color the data : hue="species" .

parameter hue

Hue: use different colors to distinguish

sns.relplot(x="sepal_length", y="sepal_width", hue="species", data=iris)

hue+ hue_order

        The order of hue in the legend can be controlled by hue_order (a list). If not set, it will be set automatically according to data.
        If hue is a numeric continuous value, hue_order does not matter.

parameter style 

style: Different representation shapes are distinguished 

sns.relplot(x="sepal_length", y="sepal_width",
            hue="species", style="species", data=iris)

 

        style Parameters can give different types of scatter points different shapes ( style="species" ) . For more parameters, I hope you can read the official documents to understand.  

2. Line graph

         Not just scatter plots, this method also supports line graphs, only need to specify parameters. Line charts and scatter charts are suitable for different types of data . A 95% confidence interval is automatically given when the line shape is drawn . kind="line" 

sns.relplot(x="sepal_length", y="petal_length",
            hue="species", style="species", kind="line", data=iris)

         You will find that above we mentioned 3 APIs, namely: relplot, scatterplot  and  lineplot . In fact, you can think of what we've already practiced as   a combined version of  and  . relplot scatterplotlineplot

        Here we need to mention the concept of API level in Seaborn. The API in Seaborn is divided into two types: Figure-level and Axes-level. relplot is a Figure-level interface, and  scatterplot and  lineplot is an Axes-level interface.

        The difference between Figure-level and Axes-level API is that Axes-level functions can achieve a more flexible and tight integration with Matplotlib, while Figure-level is more like a "lazy function", suitable for quick applications.

        For example, in the picture above, we can also use  lineplot functions to draw, you only need to cancel  the parameters relplot in  the drop kind .

sns.lineplot(x="sepal_length", y="petal_length",
             hue="species", style="species", data=iris)

 

3. Category diagram

        Similar to the association diagram, the Figure-level interface of the category diagram is  catplotthe abbreviation of categorical plots. In  catplot fact, it is a collection of the following Axes-level drawing APIs:

Next, let's take a look at  catplot the drawing effect. This method defaults to drawing kind="strip"  a scatterplot.

sns.catplot(x="sepal_length", y="species", data=iris)

   kind="swarm" Scatter points can be prevented from overlapping according to the beeswarm method , and the data distribution can be better observed.

sns.catplot(x="sepal_length", y="species", kind="swarm", data=iris)

         In the same way, hue= parameters can introduce another dimension to the image. Since the iris dataset has only one category column, we will not add  hue= parameters here. If a data set has multiple categories, hue= the parameters allow better differentiation of data points.

        Next, let's try the drawing effects of several other graphics in turn.

Draw a boxplot

sns.catplot(x="sepal_length", y="species", kind="box", data=iris)

Draw a violin plot

sns.catplot(x="sepal_length", y="species", kind="violin", data=iris)

Draw Augmented Boxplots

sns.catplot(x="species", y="sepal_length", kind="boxen", data=iris)

Draw a point-and-line graph

sns.catplot(x="sepal_length", y="species", kind="point", data=iris)

draw a bar graph

sns.catplot(x="sepal_length", y="species", kind="bar", data=iris)

Draw a bar chart of counts

sns.catplot(x="species", kind="count", data=iris)

4. Distribution map

        Distribution graphs are mainly used to visualize the distribution of variables, which are generally divided into univariate distribution and multivariate distribution . Of course, the multi-variables here mostly refer to binary variables, and more variables cannot draw intuitive visual graphics.

        The distribution map drawing methods provided by Seaborn generally include the following: jointplot , pairplot , distplot , kdeplot . Next, let's take a look at the use of these drawing methods in turn. 

        Seaborn's quick way  to look at univariate distributionsdistplot is . By default, the method will plot a histogram and fit a plot of kernel density estimation.

sns.distplot(iris["sepal_length"])

   distplot Provides parameters to adjust the histogram and kernel density estimation map, for example, you can only draw the histogram, or only draw the kernel density estimation map . Of course,  it can be specially used to draw the kernel density estimation map , and its effect is   the same as that of , but   it has more custom settings . kde=False  hist=False kdeplotdistplot(hist=False)kdeplot

sns.kdeplot(iris["sepal_length"])

  jointplot It is mainly used to draw binary variable distribution graphs . For example, we look for  relationships between d sepal_length and  sepal_width d binary feature variables. 

sns.jointplot(x="sepal_length", y="sepal_width", data=iris)

   jointplot It is not a Figure-level interface, but it supports  parameter specification to draw different styles of distribution graphs . For example, plotting a comparison of kernel density estimates. kind=

sns.jointplot(x="sepal_length", y="sepal_width", data=iris, kind="kde")

 Hexagon Count Map:

sns.jointplot(x="sepal_length", y="sepal_width", data=iris, kind="hex")

 Regression fit plot:

sns.jointplot(x="sepal_length", y="sepal_width", data=iris, kind="reg")

         The last one to be introduced  pairplot is more powerful, which supports pairwise comparison and drawing of the feature variables in the data set at one time . By default, the diagonal is a univariate distribution plot, while the others are bivariate distribution plots.

sns.pairplot(iris)

 At this point,  it will be more intuitive for us to introduce the third dimension. hue="species"

sns.pairplot(iris, hue="species")

5. Regression diagram

        Next, we continue to introduce the regression graph. The drawing functions of the regression graph mainly include: lmplot  and  regplot .

regplot When drawing a regression graph, you only need to specify the independent variable and dependent variable , and the linear regression fittingregplot  will be automatically completed .

sns.regplot(x="sepal_length", y="sepal_width", data=iris)

   lmplot It is also used to draw regression graphs, but  lmplot supports the introduction of a third dimension for comparison , such as our settings  hue="species".

sns.lmplot(x="sepal_length", y="sepal_width", hue="species", data=iris)

6. Matrix diagram

        There are only two most commonly used in the matrix diagram, namely : heatmap  and clustermap . 

        As the name suggests, heatmap it is mainly used to draw heat maps.

import numpy as np

sns.heatmap(np.random.rand(10, 10))

         Heatmaps are very useful in some scenarios, such as drawing a heatmap of variable correlation coefficients.

        In addition, clustermap it supports drawing   hierarchical clustering  structure diagrams . As shown below, we first remove the last target column in the original data set, and then pass in the feature data. Of course, you need to know something about hierarchical clustering, otherwise it will be difficult to understand the meaning of the image representation.

iris.pop("species")
sns.clustermap(iris)

         If you browse the official documentation, you will find that there are still a large number of classes that start with capital letters in Seaborn, for example  JointGrid, PairGrid etc. In fact, these classes are just   further encapsulations of the jointplotfunctions corresponding to lowercase letters. pairplotOf course, the two may be slightly different, but there is no essential difference.

        In addition,  Seaborn official documents  also   introduce some auxiliary components such as  style control  and   color customization . There is not much difficulty in the application of these APIs, the key point is to practice diligently.

homework

Use Seaborn for  tips = sns.load_dataset("tips") data visualization exploration on the example dataset.

Answer : You can refer to the following two articles

Detailed data visualization library Seaborn tutorial (1) - relplot: relationship diagram (visualization of the relationship between statistics)

Detailed data visualization library Seaborn tutorial (2) - catplot: typed data for axis drawing

Experiment summary

        This chapter gives a brief introduction to the usage of Seaborn. Here we need to explain the relationship between Seaborn and Matplotlib. Seaborn is not intended to replace Matplotlib, but should be regarded as a supplement to Matplotlib. As for Matplotlib, it is highly customizable and can achieve any effect you want. And Seaborn is very simple and fast, a few lines of code can draw not bad graphics. In short, Matplotlib is good at pure plotting, while Seaborn is mostly used for data visualization exploration.

Guess you like

Origin blog.csdn.net/m0_69478345/article/details/130044341