[Python Data Science Quick Start Series | 10] Matplotlib Data Distribution Chart Application Summary

This is the 59th post of Machine Future

Original address: https://robotsfutures.blog.csdn.net/article/details/127484292

"Python Data Science Quick Start Series" quick navigation:



Write at the beginning:

  • Blog Introduction: Focus on the AIoT field, follow the pulse of the future era, and record the technological growth on the way!
  • Blogger community: AIoT machine intelligence , welcome to join!
  • Column introduction: Master data science commonly used libraries Numpy, Matploblib, Pandas from 0 to 1.
  • Facing the crowd: AI primary learners

1 Overview

This article summarizes commonly used data distribution charts. Data distribution charts emphasize the values ​​in a data set and their frequency or distribution. Common ones include statistical histograms, kernel density curves, box plots, violin plots, etc.

2. Commonly used data distribution chart applications

2.1 Statistical histogram

A histogram, also known as a quality distribution graph, is a statistical report graph in which a series of vertical stripes or line segments of varying heights represent the distribution of data. Generally, the horizontal axis represents the data range, and the vertical axis represents the distribution. Its characteristic is to draw continuous data to show the distribution of one or more sets of data (statistics)

Statistical histogram involves statistical concepts. First find the maximum and minimum values ​​of the data, and then determine an interval to include all the measured data. Then the data interval is divided into several small intervals, and then the number of measurement data in each interval grouping is counted. In the coordinate system, the horizontal axis marks the endpoints of each group, the vertical axis represents the frequency, and the height of each rectangle represents the corresponding frequency. Such a statistical graph is called a frequency distribution histogram.

The main functions of the histogram are:

  • Ability to display data distribution or display the frequency of each group of data;
  • It is easy to display the difference in frequency or quantity between each group of data. Through the histogram, it is also possible to observe and estimate which data is relatively concentrated, abnormal or isolated data distribution.

Compared with the histogram:
the histogram uses the length of the rectangle to represent the frequency or quantity of each group, and its width (representing the category) is fixed, which is conducive to the analysis of smaller data sets.
The histogram uses the length of the rectangle to represent the frequency or quantity of each group, and the width represents the group distance of each group. Therefore, its height and width are meaningful, which is beneficial to display the statistical results of a large number of data sets.
Due to the continuity of grouped data, the rectangles of the histogram are usually arranged consecutively , while the histograms are arranged separately.

import numpy as np
from matplotlib import pyplot as plt

"""
    加载鸢尾花数据集
"""
import numpy as np

data = []
column_name = []
with open(file='iris.txt',mode='r') as f:
    # 过滤标题行
    line = f.readline()
    if line:
        column_name = np.array(line.strip().split(','))
        
    while True:
        line = f.readline()
        if line:
            data.append(line.strip().split(','))
        else:
            break

data = np.array(data,dtype=float)

# 使用切片提取前4列数据作为特征数据
X_data = data[:, :4]  # 或者 X_data = data[:, :-1]

# 使用切片提取最后1列数据作为标签数据
y_data = data[:, -1]

data.shape, X_data.shape, y_data.shape
((150, 5), (150, 4), (150,))
"""
展示鸢尾花不同特征的数据分布情况
"""

# windows配置SimHei,Ubuntu配置WenQuanYi Micro Hei
plt.rcParams["font.sans-serif"]=["WenQuanYi Micro Hei"] #设置字体
plt.rcParams["axes.unicode_minus"]=False #该语句解决图像中的“-”负号的乱码问题

fig, ax = plt.subplots(figsize=(12,9))
ax.hist(X_data[:, 0], bins=16, alpha = 0.7, density=True, label="花萼长度")
ax.hist(X_data[:, 1], bins=16, alpha = 0.7, density=True, label="花萼宽度")
ax.hist(X_data[:, 2], bins=16, alpha = 0.7, density=True, label="花瓣长度")
ax.hist(X_data[:, 3], bins=16, alpha = 0.7, density=True, label="花瓣宽度")
ax.legend()

plt.show()


png

x - data set
bins - number of groups, corresponding to the group distance
alpha - when corresponding to multiple legends, the transparency of the legend chart, multiple legends can be displayed at the same time
density - convert the frequency of the vertical axis to a density indicator, the height density of all groups 1 after the product
label - the name of the chart

2.2 Kernel density estimation curve

The Kernel Density Estimation chart is used to display the distribution of data within a continuous data segment on the X-axis. This chart is a variant of the histogram that uses smooth curves to draw numerical levels, resulting in a smoother distribution. Their advantage over statistical histograms is that they are independent of the number of bins used, so better define the shape of the distribution.

import seaborn as sns

fig, ax = plt.subplots(figsize=(12, 9))

"""
展示鸢尾花不同特征的数据分布情况
"""
plt.rcParams["font.sans-serif"]=["WenQuanYi Micro Hei"] #设置字体
plt.rcParams["axes.unicode_minus"]=False #该语句解决图像中的“-”负号的乱码问题

ax.hist(X_data[:, 0], bins=16, alpha = 0.7, density=True, color='hotpink', label="花萼长度")
ax.hist(X_data[:, 1], bins=16, alpha = 0.7, density=True, color='m', label="花萼宽度")
ax.hist(X_data[:, 2], bins=16, alpha = 0.7, density=True, color='green', label="花瓣长度")
ax.hist(X_data[:, 3], bins=16, alpha = 0.7, density=True, color='b', label="花瓣宽度")

sns.kdeplot(X_data[:, 0], ax=ax, color='hotpink')
sns.kdeplot(X_data[:, 1], ax=ax, color='m')
sns.kdeplot(X_data[:, 2], ax=ax, color='green')
sns.kdeplot(X_data[:, 3], ax=ax, color='b')

ax.legend()
plt.show()


png

2.3 Box plot

The biggest advantage of the box plot is that it is not affected by outliers, can accurately and stably describe the discrete distribution of data, and is also conducive to data cleaning.

Box plots (also known as box-and-whisker plots) were invented in 1977 by the famous American statistician John Tukey. It can display the maximum, minimum, median, and upper and lower quartiles of a set of data.

In a boxplot, we draw a box from the upper quartile to the lower quartile, and then run a vertical whisker (picturedly called a "box whisker") through the middle of the box. A vertical line extends to the upper edge (maximum value) and a vertical line extends to the lower edge (minimum value).

The boxplot structure looks like this:

In a boxplot, we draw a box from the upper quartile to the lower quartile, which means the box contains 50% of the data. Therefore, the width of the box reflects the volatility of the data to some extent.

Relationship between Boxplot and Normal Distribution

Application scenarios of box plots: draw grouped box plots with qualitative variables for comparison .

2.3.1 Example analysis

What is qualitative is simply classification. To give a simple example, the iris data set, the numerical distribution, median, fluctuation degree and outliers of a single feature in multiple categories of the data set.

# 箱形图
# 进一步查看某种类下,各特征的值分布, 圆圈是离群点

print(X_data[y_data==0].shape, column_name.shape, y_data.shape)

# 山鸢尾(Setosa)、变色鸢尾(Versicolor)、维吉尼亚鸢尾(Virginical)
label_class_name = np.array(['Setosa', 'Versicolor', 'Virginical'])

fig, ax = plt.subplots(1, 4, figsize=(15, 6))

for i in range(4):
    # 添加y轴标签
    ax[i].set_ylabel(column_name[i])
    # 对标签对应的数据进行组合
    X_data_p = [X_data[y_data == 0][:, i], X_data[y_data == 1][:, i], X_data[y_data == 2][:, i]]
    # 绘制箱形图、横轴标签
    bplot = ax[i].boxplot(X_data_p, patch_artist=True, labels=label_class_name[0:3])

    ###遍历每个箱子对象
    colors = ['pink', 'lightblue', 'lightgreen']  ##定义柱子颜色、和柱子数目一致
    for patch,color in zip(bplot['boxes'],colors): ##zip快速取出两个长度相同的数组对应的索引值
        patch.set_facecolor(color)   ##每个箱子设置对应的颜色

# 调整子图上下间距
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.3, hspace=0.0)
plt.show()
(50, 4) (5,) (150,)

png

As can be seen from the above figure, the qualitative grouping is based on the characteristics of iris flowers. You can see that the positions of the box plots of different labels are patchy when grouping. The position difference is actually quite obvious. The box type actually represents the cumulative value near the median. For 50% of the characteristic data, the upper and lower boundaries of the box represent the upper and lower quartiles. From the upper and lower intervals of the upper and lower quartiles, you can see the intuitive range of data fluctuations of different characteristics, and you can also see the abnormalities represented by hollow circles value.

From the distribution of the data relationship between the sepal length sepal_length and the iris data label in the first picture on the far left:

  • The sepal length sepal_length has a smaller median in the Setosa classification, a larger median in Versicolor iris, and the largest median in Virginical iris;
  • The length of the box represents the concentration of each group of data. The sepal length sepal_length is relatively concentrated in the Setosa iris, and the Versicolor iris and Virginical iris are relatively scattered;
  • The position of the median and the box shows the distribution state of each group of data. The sepal length sepal_length is normally distributed in Setosa iris, and the sepal length sepal_length is right-biased in Versicolor iris and Virginical iris;

2.3.2 Value of Box Plots

  • Intuitively identify outliers in data batches.
  • Use boxplots to determine the skewness and tail weight of a data batch.

For a sample from a standard normal distribution, only a very small number of values ​​are outliers. The more outliers, the heavier the tail and the smaller the degree of freedom (that is, the number of freely changing quantities); while the skewness indicates deviation from the program, and the outliers are concentrated on the side of the smaller value, the distribution is left skewed, and the outliers Concentrated on the side of larger values, the distribution is right skewed.

  • Use box plots to compare the shape of several batches of data.

On the same number axis, the box plots of several batches of data are arranged in parallel, and the shape information such as the median, tail length, outliers, and distribution intervals of several batches of data can be seen at a glance.

2.3.3 Selection of box plots and histograms

Suppose I now want to compare the teaching assessment scores of male and female teachers, what tool is the best to use. The answer is boxplots. There is no harm if there is no comparison. Everyone can clearly feel that the box plot is a more effective tool from the above picture. It can evaluate the teaching evaluation scores of male and female teachers from the average level (median), degree of fluctuation (box width) and outliers. comparison, but the histogram cannot.

2.3.4 Usage summary

  • The boxplot is for continuous variables, and when interpreting, focus on the average level, degree of volatility, and outliers.

  • When the box is squished very flat, or has a lot of anomalies, try doing a logarithmic transformation. When there is only one continuous variable, it is not suitable to draw a boxplot, and a histogram is a more common choice.

  • The most effective way to use the boxplot is to make a comparison and draw a grouped boxplot with one or more qualitative data.

Limitations of boxplots:

  • Skewness and tail-heavy procedures that cannot accurately measure data distributions;
  • For data with a relatively large batch size, the reflected information is more ambiguous and there are certain limitations in using the median to represent the overall evaluation level.

2.4 Violin plot

The violin plot can be understood as a box plot + kernel density estimation curve.

Violin Plot (Violin Plot) is a chart used to display data distribution status and probability density. This type of chart combines the features of box plots and density plots. A violin plot is similar to a box plot, except that the violin plot also shows the probability density of the data at different values.

The violin plot uses kernel density estimation (KDE) to calculate the distribution of the sample, and the elements in the figure include the median, interquartile range, and confidence interval. The violin plot is especially suitable when the amount of data is very large and it is inconvenient to display one by one.

# 箱形图
# 进一步查看某种类下,各特征的值分布, 圆圈是离群点

print(X_data[y_data==0].shape, column_name.shape, y_data.shape)

# 山鸢尾(Setosa)、变色鸢尾(Versicolor)、维吉尼亚鸢尾(Virginical)
label_class_name = np.array(['Setosa', 'Versicolor', 'Virginical'])

fig, ax = plt.subplots(1, 4, figsize=(15, 6))

for i in range(4):
    # 添加y轴标签
    ax[i].set_ylabel(column_name[i])
    # 对标签对应的数据进行组合
    X_data_p = [X_data[y_data == 0][:, i], X_data[y_data == 1][:, i], X_data[y_data == 2][:, i]]
    # 绘制箱形图、横轴标签
    bplot = ax[i].violinplot(X_data_p, showmedians=True)

# 调整子图上下间距
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=0.3, hspace=0.0)
plt.show()
(50, 4) (5,) (150,)

png

3. Summary

Judging from the above common data distribution charts, they are all applicable to the application scenarios of continuous data distribution.

The histogram can display the data distribution or the frequency of each group of data, and it is easy to display the frequency or quantity difference between each group of data. Through the histogram, you can also observe and estimate which data is relatively concentrated, abnormal or isolated data distribution.

Kernel density estimation curves have an advantage over statistical histograms in that they are independent of the number of bins used and thus better define the shape of the distribution.

Box plots can accurately and stably describe the discrete distribution of data, intuitively and clearly identify outliers in data batches, data fluctuations and data distribution laws, and are typically used in qualitative comparative analysis.

The violin plot can be understood as a combination of box plot + kernel density estimation curve. In addition to the advantages of box plot, it also shows the probability density of data under different values, which is especially suitable for application scenarios with a large amount of data.

references:

— Recommended by bloggers’ popular columns —

Guess you like

Origin blog.csdn.net/RobotFutures/article/details/127484292