Practical data analysis----Seaborn drawing statistical graphics

1.1 Seaborn----Drawing statistical graphics

learning target

  • Know the basic use of seaborn
  • Will plot univariate distribution graphs
  • Will plot bivariate distribution graphs
  • Pairs of bivariate distributions will be plotted

Although Matplotlib is already a relatively excellent drawing library, it has a headache for today's people, that is, the API is too complicated to use. It has thousands of functions and parameters. It is a typical kind that you can use it to do anything. But there was no way to start.

Seaborn has a more advanced API encapsulation based on the Matplotlib core library, which can easily draw more beautiful graphics. The beauty of Seaborn is mainly reflected in more comfortable color matching and more delicate styles of graphic elements.

However, before using Seaborn to draw charts, you need to install and import the drawing interface. The specific code is as follows:

# 安装 
pip3 install seaborn
# 导入
import seaborn as sns

Next, we officially enter the study of Seaborn library

1 Visualizing the distribution of data

When working with a set of data, the first thing you usually need to do is understand how the variables are distributed.

  • For univariate data, histograms or kernel density curves are a good choice.
  • For bivariates, multi-panel graphics can be used, such as scatter plots, two-dimensional histograms, kernel density estimation graphics, etc.

In response to this situation, the Seaborn library provides functions for drawing single-variable and bi-variable distributions, such as distplot() function and jointplot() function. The use of these functions is introduced below. The specific content is as follows:

2 Plot univariate distribution

The simplest histogram can be used to describe the distribution of a single variable. Seaborn provides the distplot() function, which by default draws a histogram with a kernel density estimation curve. The syntax format of the distplot() function is as follows.

seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False)

The meanings of the commonly used parameters in the above functions are as follows:

  • (1) a: Indicates the data to be observed, which can be a Series, one-dimensional array or list.

  • (2) bins: used to control the number of bars.

  • (3) hist: Receives a Boolean type, indicating whether to draw (label) a histogram.

  • (4) kde: Receives a Boolean type, indicating whether to draw a Gaussian kernel density estimation curve.
  • (5) rug: Receives a Boolean type, indicating whether to draw rugplot in the supported axis direction.

An example of drawing a histogram through the distplot()) function is as follows.

import numpy as np

sns.set() #设置风格,颜色字体等,这里使用默认
np.random.seed(0)  # 确定随机数生成器的种子,如果不使用每次生成图形不一样
arr = np.random.randn(100)  # 生成随机数组

ax = sns.distplot(arr, bins=10, hist=True, kde=True, rug=True)  # 绘制直方图

In the above example, the numpy library for generating arrays is first imported, then seaborn is used to call the set() function to obtain the default drawing, and the seed function of the random module is called to determine the seed of the random number generator, ensuring that the random number generated each time is Same, then call the randn() function to generate an array containing 100 random numbers, and finally call the distplot() function to draw a histogram.

The running results are shown in the figure below.

Histogram effect

As can be seen from the picture above:

  • The histogram has a total of 10 bars, each bar is colored blue, and has a kernel density estimation curve.
  • According to the height of the bar, we can see that there are more random values ​​in the -1-1 interval, and there are fewer random values ​​less than -2.

Usually, the distribution of sample data can be more intuitively displayed by using a histogram. However, there are some problems with the histogram. The effect of the histogram will be greatly different due to the different number of bars. In order to solve this problem, the kernel density estimation curve can be drawn for display.

  • Kernel density estimation is used to estimate unknown density functions in probability theory. It is one of the non-parametric testing methods and can intuitively see the distribution characteristics of the data sample itself.

An example of plotting a kernel density estimation curve through the distplot() function is as follows.

# 创建包含500个位于[0,100]之间整数的随机数组
array_random = np.random.randint(0, 100, 500)
# 绘制核密度估计曲线
sns.distplot(array_random, hist=False, rug=True)

In the above example, the random.randint() function is first used to return an array of 500 random integers with a minimum value not less than 0 and a maximum value less than 100, and then the displot() function is called to draw the kernel density estimation curve.

The running results are shown in the figure.

500 random arrays

As can be seen from the figure above, there is a kernel density estimation curve in the chart, and small thin bars of observed values ​​are generated above the x-axis.

3 Plot a bivariate distribution

Visualizing the bivariate distribution of two variables is also useful. The simplest method in Seaborn is to use the jointplot() function , which can create a multi-panel graph, such as a scatter plot, a two-dimensional histogram, a kernel density estimate, etc., to show the bivariate relationship between two variables and Univariate distribution of each variable on a single axis.

The syntax format of the jointplot() function is as follows.

seaborn.jointplot(x, y, data=None, 
                  kind='scatter', color=None, height=6,
                  ratio=5, space=0.2)

The meanings of the commonly used parameters in the above functions are as follows:

  • (1) x: Keyword of the x-axis data in the data data.

  • (2) y: Keyword of the y-axis data in the data data.

  • (3)data: the data to be drawn

  • (4) kind: Indicates the type of drawing graphics.

  • (5) color: Indicates the color of the drawing element.

  • (6) height: used to set the size of the image (square).

  • (7) Ratio: Indicates the ratio of the center image to the side image. The larger the value of this parameter, the larger the proportion of the central image will be.

  • (8) space: used to set the interval size between the center image and the side images.

The following takes scatter plots, two-dimensional histograms, and kernel density estimation curves as examples to introduce how to use Seaborn to draw these graphics.

3.1 Draw a scatter plot

An example of calling the seaborn.jointplot() function to draw a scatter plot is as follows.

import numpy as np
import pandas as pd
import seaborn as sns

# 创建DataFrame对象
dataframe_obj = pd.DataFrame({
    
    "x": np.random.randn(500),"y": np.random.randn(500)})
# 绘制散布图
sns.jointplot(x="x", y="y", data=dataframe_obj)

In the above example, a DataFrame object dataframe_obj is first created as the data of the scatter plot, in which the data of the x-axis and y-axis are both 500 random numbers. Then the jointplot0 function is called to draw a scatter plot, and the name of the x-axis of the scatter plot is is "x" and the name of the y-axis is "y".

The running results are shown in the figure.

image-20200113154807307

3.2 Draw a two-dimensional histogram

**The 2D histogram is similar to a "hexagon" plot, mainly because it shows the count of observations falling within a hexagonal area, making it suitable for larger data sets. **When calling the jointplot() function, as long as you pass in kind="hex", you can draw a two-dimensional histogram. The specific example code is as follows.

# 绘制二维直方图
sns.jointplot(x="x", y="y", data=dataframe_obj, kind="hex")

The running results are shown in the figure.

image-20200113155036918

**From the depth of the hexagonal color, you can observe the degree of data density. **In addition, the histogram is still given above and on the right side of the graph. Note that when plotting a 2D histogram, it is best to use a white background.

3.3 Draw kernel density estimation graph

Kernel density estimation can also be used to view bivariate distributions, which are represented by contour plots. When calling the jointplot() function, as long as ind="kde" is passed in, the kernel density estimation graph can be drawn. The specific example code is as follows.

sns.jointplot(x="x", y="y", data=dataframe_obj, kind="kde")

In the above example, a contour plot of the kernel density is plotted, and a plot of the kernel density is given above and to the right of the graph.

The running results are shown in the figure.

image-20200113155213757

By observing the color depth of the contour lines, you can see which range has the most values ​​and which range has the least distribution.

4 Plotting pairs of bivariate distributions

To plot multiple pairs of bivariate distributions in a data set, you can use the pairplot() function, which creates an axis matrix and displays the relationship between each pair of variables in the Dataframe object. In addition, the pairplot() function also plots a histogram of the univariate distribution of each variable on the diagonal axis.

Next, use the sns.pairplot() function to draw a graph of the relationship between the variables in the data set. The sample code is as follows

# 加载seaborn中的数据集
import pandas as pd
dataset = pd.read_csv("./data/iris.csv")

dataset.head()
image-20200113155956764

In the above example, the built-in data set in seaborn is loaded through the load_dataset0 function, and multiple bivariate distributions are drawn based on the iris data set.

# 绘制多个成对的双变量分布
sns.pairplot(dataset)

The results are shown below.

image-20200113160107942

5 Summary

  • Basic use of seaborn [Understanding]
  • Draw univariate distribution graph [Know]
    • seaborn.distplot()
  • Draw a bivariate distribution graph [Know]
    • seaborn.jointplot()
  • Plot paired bivariate distribution graphs [Know]
    • Seaborn.pairplot()

1.2 Plotting with categorical data

learning target

  • Can use seaborn to draw box plots and violin plots

There are many types of data in the data set. In addition to continuous feature variables, the most common is categorical data, such as a person's gender, education, hobbies, etc. These data types cannot be represented by continuous variables, and It is represented by classified data.

Seaborn provides special visualization functions for categorical data. These functions can be roughly divided into the following two types:

  • Categorical data scatter plot: swarmplot() and stripplot().
  • Distribution plots of class data: boxplot() and violinplot().

The following two sections will briefly introduce the graphics that can be drawn for classified data. The specific content is as follows

1 Category Scatter Plot

A scatter plot can be drawn through the stripplot() function. The syntax format of the stripplot0 function is as follows.

seaborn.stripplot(x=None, y=None, hue=None, data=None, jitter=False)

The meanings of the commonly used parameters in the above functions are as follows:

  • (1) x: Keyword of the x-axis data in the data data.
  • (2) y: Keyword of the y-axis data in the data data.
  • (3) hue: the keyword of the category data in the data data.
  • (2) data: Data set used for drawing. If x and y are not present, it will be used as wide format, otherwise it will be used as long format.
  • (3) jitter: Indicates the degree of jitter (only along the category axis). When many data points overlap, the amount of jitter can be specified.

In order to give everyone a better understanding, next, draw a scatter plot through the stripplot() function. The sample code is as follows.

# 获取tips数据,(小费数据)
import pandas as pd
tips = pd.read_csv("./data/tips.csv")

sns.stripplot(x="day", y="total_bill", data=tips,jitter=False)

The running results are shown in the figure below.

image-20200113160626333

As can be seen from the above figure, the abscissa in the chart is classified data, and some data points overlap each other, making it difficult to observe. In order to solve this problem, you can pass in the jitter parameter when calling the striplot() function to adjust the position of the abscissa. The modified sample code is as follows.

sns.stripplot(x="day", y="total_bill", data=tips, jitter=True)

The running results are shown in the figure below.

jitter parameters of striplot

In addition, the swarmplot function can also be called to draw a scatter plot. The advantage of this function is that all data points will not overlap and the distribution of the data can be clearly observed. The sample code is as follows.

sns.swarmplot(x="day", y="total_bill", data=tips)

The running results are shown in the figure.

image-20200113160705784

2 Data distribution within categories

If you want to view the data distribution in each category, it is obvious that the scatter plot does not meet the needs because it is not intuitive enough. For this situation, we can draw the following two graphics to view:

  • Box plot :
    • Box-plot, also known as box-and-whisker plot, box plot or box plot, is a statistical chart used to display the dispersion of a set of data. It is named after its shape like a box.
    • The box plot was invented in 1977 by the famous American statistician John Tukey. It can display the maximum value, minimum value, median, and upper and lower quartiles of a set of data.
img
  • Violin diagram:
    • Violin Plot is used to display data distribution and its probability density.
    • This chart combines the characteristics of box plots and density plots, and is mainly used to display the distribution shape of data.
    • The thick black bar in the middle represents the 25% dividing point to the 75% dividing point, and the thin black line extending from it represents the 95% confidence interval, that is, the 2.5% dividing point - the 97.5% dividing point, while the white point is median.
    • **Box plots are limited in their ability to display data, and the simple design often hides important details about the distribution of the data. **For example, when using box plots, we cannot understand the distribution of the data. While violin plots can show more detail, they can also contain more noisy information.
image-20200113164802237

Next, we will give a brief introduction to the drawing of box plots and violin plots in the Seaborn library.

2.1 Draw box plots

The function used to draw box plots in seaborn is boxplot(), and its syntax format is as follows:

seaborn.boxplot(x=None, y=None, hue=None, data=None)

A specific example of using the boxplot() function to draw a box plot is as follows.

sns.boxplot(x="day", y="total_bill", data=tips)

In the above example, a box plot is drawn using the built-in data set tips in seaborn. The name of the x-axis in the figure is day, its scale range is Thur~Sun (Thursday to Sunday), and the name of the y-axis is total_bill. The scale range is about 10-50

The running results are shown in the figure.

box plot

As can be seen from the figure,

  • Most of the data in the Thur column are less than 30, but there are 5 outliers greater than 30.
  • Most of the data in the Fri column are less than 30, and there is only one outlier greater than 40.
  • There are 3 outliers greater than 40 in the Sat column,
  • There are two outliers greater than 40 in the Sun column

2.2 Draw a violin diagram

The function used to draw violin plots in seaborn is violinplot(), and its syntax format is as follows

seaborn.violinplot(x=None, y=None, hue=None, data=None)

The sample code for drawing a violin plot through the violinplot() function is as follows

sns.violinplot(x="day", y="total_bill", data=tips)

In the above example, a violin chart is drawn using the built-in data set in seaborn. The name of the x-axis in the chart is day and the name of the y-axis is total_bill.

The running results are shown in the figure.

Box diagram 1

As can be seen from the figure,

  • There are more values ​​between 5 and 25 in the Thur column.
  • There are more numbers between 5-30 in the Fri column.
  • There are more values ​​between 5-35 in the Sat- column,
  • There are more values ​​between 5-40 in the Sun column.

4 Summary

  • Category scatter plot
    • seaborn.stripplot()
  • Distribution of data within categories
    • boxplot
      • seaborn.boxplot()
    • Violin diagram
      • seaborn.violinplot()

1.3 Case: NBA player data analysis

1 Introduction to basic data

Every fan has his own Michael Jordan, Kobe Bryant, and LeBron James in his heart. In this case, jupyter notebook will be used to complete the preliminary data analysis of NBA rookies.

The data used in the case is the basic data of NBA players in 2017. The data fields are shown in the table below:

rk: level

player:player

position: position

age: age

TEAM: Team to which you belong

Field explanation

2 Basic analysis of the case

See [nba_data.ipynb] for detailed analysis

Guess you like

Origin blog.csdn.net/weixin_52733693/article/details/127932544