Types of univariate graphs and basics of histogram drawing

Chart for one variable refers to using one variable of the data group to draw the corresponding chart. To visualize this variable, you need to draw graphs based on different data variable types. Data variables are divided into continuous variables and discrete variables.

Types of univariate plots

1. Histogram plot

The histogram is a statistical graphic used to represent the distribution and discreteness of data. Its appearance is similar to that of a column chart, but its meaning is quite different from that of a column chart.

First, the data groups need to be grouped, then the number of data elements in each group is counted, and finally a series of rectangles with equal width and different heights are used to represent the corresponding number of data elements in each group. The drawing idea based on "statistical data frequency" is more commonly used in some drawings with color mapping.

2. Density plot

As a variant of the histogram, the density chart (also known as the density curve chart) uses a curve (in most cases a smooth style, but a right-angled style may also appear due to different kernel functions) to reflect the numerical level. Its main function is Reflects the distribution of data in continuous time periods.

Compared with histograms, density charts do not cause incomplete data display due to the number of groups, thus helping users effectively judge the overall trend of the data. Of course, if you choose different kernel functions, the kernel density estimation graphs drawn will be different. In the drawing process of some scientific research papers, the vertical axis of the density plot can be frequency (count) or density (density).

3.QQ plot (Quantile- Quantile plot, also known as quantile plot)

The essence of the QQ chart is a probability chart, and its function is to test whether the data distribution obeys a certain distribution . The key to testing data distribution with QQ plots is to compare probability distributions by plotting quantiles . First, select the interval length. The point (x, y) on the QQ chart corresponds to the quantile of the first distribution (X-axis) and the same quantile of the second distribution (Y-axis). Therefore, a curve can be drawn with the number of intervals as a parameter. If the two distributions are similar, the QQ plot tends to fall on the y = x line. If two distributions are linearly related, the points on the QQ plot tend to fall on a straight line.

For example, the QQ plot of the normal distribution is a scatter plot with the quantiles of the standard normal distribution as the abscissa and the sample data values ​​as the ordinate. When you want to use the QQ chart to identify the normal distribution of a certain sample data, you only need to observe whether approximately near a straight line , and the slope of the straight line is the standard deviation and the intercept is the mean .

The QQ chart can not only test whether the sample data conforms to a certain data distribution, but also discover the attributes of the data in terms of location, scale and skewness by comparing the shape of the data distribution.

In general academic research, the frequency of using histograms or density plots to observe data distribution is much higher than that of QQ plots.

4.P-P 图(Probability-Probability plot)

The PP chart is a graph drawn based on the relationship between the cumulative probability of a variable and the cumulative probability of a specified theoretical distribution . It is used to visually test whether the sample data conforms to a certain probability distribution . When the test sample data conforms to the expected distribution, the points in the PP chart will appear as a straight line. Both the PP chart and the QQ chart are used to test whether the sample data conforms to a certain distribution, but the testing methods are different.

5. Empirical Distribution Function (EDF)

In statistics, the empirical distribution function is also called the empirical cumulative distribution function. The empirical distribution function is a distribution function related to the test measure of the sample. For a certain value of the measured variable, the value of the distribution function of the value represents the proportion of samples that are less than or equal to the value among all test samples. The empirical distribution function plot is used to test whether the sample data conforms to a certain expected distribution .

Histogram

In Matplotlib, we can axes.Axes.Hist ()plot histograms using the function.

In axes.Axes.Hist ()the function, the parameter x is the sample data to be plotted; the parameter bins is used to define the distribution interval. The value of this parameter can be set to an integer, a given numerical sequence or a string. The default is a numerical type and the value is 10. When the value of parameter bins is an integer, it defines the number of equal-width bins within the range. When the value of the parameter bins is a custom numerical sequence, define the bin edge value, including the left edge of the first bin and the right edge of the last bin.

Note that in this case the bins may not be equally spaced.

When the value of parameter bins is a string type, values ​​such as "auto", "fd", "rice" and "sqrt" are optional. axes.Axes.Hist ()The value corresponding to the parameter density of the function is of Boolean type. This parameter determines whether the drawing result is a density map. The default value is False.

The following are examples of histograms drawn using Matplotlib, ProPlot and SciencePlots respectively:

(a) and (c) are both visualization results drawn based on Matplotlib, and (c) is drawn using the plotting theme in the SciencePlots package. The core drawing code of (a) is given below.

import numpy as np
import pandas as pd

hist_data = pd.read_excel(r"柱形图绘制数.xlsx")

#(a) Matplotlib绘制的直方图
import matplotlib.pyplot as plt

plt.rcParams["font.family"] = "Times New Roman"
plt.rcParams["axes.linewidth"] = 1
plt.rcParams["axes.labelsize"] = 15
plt.rcParams["xtick.minor.visible"] = True
plt.rcParams["ytick.minor.visible"] = True
plt.rcParams["xtick.direction"] = "in"
plt.rcParams["ytick.direction"] = "in"
plt.rcParams["xtick.labelsize"] = 12
plt.rcParams["ytick.labelsize"] = 12
plt.rcParams["xtick.top"] = False
plt.rcParams["ytick.right"] = False

hist_x_data = hist_data["hist_data"].values
bins = np.arange(0.0,1.5,0.1)

fig,ax = plt.subplots(figsize=(4,3.5),dpi=100,facecolor="w")
hist = ax.hist(x=hist_x_data, bins=bins,color="#3F3F3F",
          edgecolor ='black',rwidth = 0.8)

ax.tick_params(axis="x",which="minor",top=False,bottom=False)
ax.set_xticks(np.arange(0,1.4,0.1))
ax.set_yticks(np.arange(0.,2500,400))
ax.set_xlim(-.05,1.3)
ax.set_ylim(0.0,2500)

ax.set_xlabel('Values', )
ax.set_ylabel('Frequency')

plt.show()

The core drawing code of (b) is as follows:

#(b)ProPlot绘制的直方图

import proplot as pplt
from proplot import rc
rc["axes.labelsize"] = 15
rc['tick.labelsize'] = 12
rc["suptitle.size"] = 15

hist_x_data = hist_data["hist_data"].values
bins = np.arange(0.0,1.5,0.1)


fig = pplt.figure(figsize=(3.5,3))
ax = fig.subplot()
ax.format(abc='a.', abcloc='ur',abcsize=16,
          xlabel='Values', ylabel='Frequency',
          xlim = (-.05,1.3),ylim=(0,2500))
hist = ax.hist(x=hist_x_data, bins=bins,color="#3F3F3F",
               edgecolor ='black',rwidth = 0.8)

plt.show()

(c) Using the excellent drawing theme in SciencePlots, the user only needs to add the following code before the drawing script.

with plt.style.context(['science']):

The core code is as follows:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

hist_x_data = hist_data["hist_data"].values
bins = np.arange(0.0,1.5,0.1)


with plt.style.context(['science']):
    fig,ax = plt.subplots(figsize=(4,3.5),dpi=100,facecolor="w")
    hist = ax.hist(x=hist_x_data, bins=bins,color="#3F3F3F",
                   edgecolor ='black',rwidth = 0.8)
    ax.set_xlim(-.05,1.4)
    ax.set_ylim(0.0,2500)
    ax.set_xlabel('Values', )
    ax.set_ylabel('Frequency')

plt.show()

Sometimes, in order to display some necessary statistical information, we need to add a normal distribution curve, a mean line, a median line, etc. to the histogram, or in the style of a short vertical line Data points are represented at X-axis locations.

An example of a histogram drawn by Matplotlib with a normal distribution curve and a median line added is as follows:

The difficulty in drawing a histogram with statistical information lies in the calculation and drawing of the normal distribution curve. We can use scipy.Stats.Norm () the function to implement normal fitting to the plotted data and calculate the probability density function (Probability Density Function, PDF) result.

Since the probability density function result is normalized, that is, the area under the curve is 1, and the total area of ​​the histogram is the product of the number of samples and the width of each bin, therefore, the probability density function result is related to the number of samples and the width of the bin. By plotting the result of multiplying the values, the plotted curve can be scaled to the height of the histogram.

The code for drawing the above picture is as follows:

import numpy as np
import pandas as pd

hist_data = pd.read_csv(r"直方图绘制02.xlsx")

hist_x_data = hist_data02["hist_data"].values
X_mean = np.mean(hist_x_data)


# 图3-2-2 带统计信息的直方图绘制示例
from scipy.stats import norm
import matplotlib.pyplot as plt

bins=15
hist_x_data = hist_data02["hist_data"].values

Median = np.median(hist_x_data)

mu, std = norm.fit(hist_x_data)

fig,ax = plt.subplots(figsize=(5,3.5),dpi=100,facecolor="w")
hist = ax.hist(x=hist_x_data, bins=bins,color="gray",
               edgecolor ='black',lw=.5)
# Plot the PDF.
xmin, xmax = min(hist_x_data),max(hist_x_data)
x = np.linspace(xmin, xmax, 100) # 100为随机选择,值越大,绘制曲线越密集
p = norm.pdf(x, mu, std)
N = len(hist_x_data)
bin_width = (x.max() - x.min()) / bins
ax.plot(x, p*N*bin_width,linewidth=1,color="r",label="Normal Distribution Curve")

# 添加平均值线
ax.axvline(x=Median,ls="--",lw=1.2,color="b",label="Median Line")
ax.set_xlabel('Values')
ax.set_ylabel('Count')
ax.legend(frameon=False)

plt.show()

Below is an example of a histogram with statistics drawn using ProPlot and SciencePlots.

a. in (a) is the graphic serial number, which can be added according to the actual situation. In addition to drawing histograms using the above method, we can also use the histplot () function in Seaborn, which is more flexible in use.

# (a)使用ProPlot绘制的带统计信息的直方图示例
from scipy.stats import norm
from proplot import rc

rc["axes.labelsize"] = 15
rc['tick.labelsize'] = 12
rc["suptitle.size"] = 15


bins=15
hist_x_data = hist_data["hist_data"].values
Median = np.median(hist_x_data)
mu, std = norm.fit(hist_x_data)

fig = pplt.figure(figsize=(3.5,3))
ax = fig.subplot()
ax.format(abc='a.', abcloc='ur',abcsize=16,
          xlabel='Values', ylabel='Count')

hist = ax.hist(x=hist_x_data, bins=bins,color="gray",
               edgecolor ='black',lw=.5)
# Plot the PDF.
xmin, xmax = min(hist_x_data),max(hist_x_data)
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
N = len(hist_x_data)
bin_width = (x.max() - x.min()) / bins
ax.plot(x, p*N*bin_width,linewidth=1,color="r",label="Normal Distribution Curve")
# 添加平均值线
ax.axvline(x=Median,ls="--",lw=1.2,color="b",label="Median Line")
ax.legend(ncols=1,frameon=False,loc="ur")
plt.show()
# (b)使用SciencePlots 绘制的带统计信息的直方图示例

from scipy.stats import norm

bins=15
hist_x_data = hist_data["hist_data"].values
Median = np.median(hist_x_data)
mu, std = norm.fit(hist_x_data)
xmin, xmax = min(hist_x_data),max(hist_x_data)
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
N = len(hist_x_data)
bin_width = (x.max() - x.min()) / bins

with plt.style.context(['science']):
    fig,ax = plt.subplots(figsize=(4,3.5),dpi=100,facecolor="w")
    hist = ax.hist(x=hist_x_data, bins=bins,color="gray",
                   edgecolor ='black',lw=.5)
    ax.plot(x, p*N*bin_width,linewidth=1,color="r",label="Normal Distribution Curve")

    # 添加平均值线
    ax.axvline(x=Median,ls="--",lw=1.2,color="b",label="Median Line")
    ax.set_xlabel('Values')
    ax.set_ylabel('Count')
    ax.legend(frameon=False)

plt.show()

Reference books: Ning Haitao. Guide to illustrating scientific research papers - based on Python [M]. Beijing: People's Posts and Telecommunications Press, 2023: 47-49.

Guess you like

Origin blog.csdn.net/m0_52316372/article/details/132570799