Introduction: Histograms and histograms are very common and commonly used charts in data analysis. Since the two look very similar in appearance, it is inevitable to cause some confusion. We have previously discussed the difference between a column chart, a stacked column chart, and a waterfall chart? How to draw with Python? "In the article, I took you to understand the histogram, and today we will talk about the histogram.
Author: Qu Xifeng, senior Python engineer, Zhihu columnist
Source: Big Data DT (ID: hzdashuju)
01 Overview
Histogram, similar in shape to a bar chart but has a completely different meaning from a bar chart. The histogram involves the concept of statistics. First, the data should be grouped, and then the number of data elements in each group should be counted. In the plane rectangular coordinate system, the horizontal axis marks the endpoint of each group, the vertical axis represents the frequency, and the height of each rectangle represents the corresponding frequency. Such a statistical graph is called a frequency distribution histogram.
The frequency distribution histogram needs to go through the calculation process of multiplying the frequency by the group distance to obtain the number of each group. The group distance of the same histogram is a fixed value, so if the number is directly represented by the vertical axis, each rectangle The high value represents the corresponding number of data elements, which can not only keep the distribution state unchanged, but also intuitively see the number of each group, as shown in Figure 2-58.
▲Figure 2-58 Histogram
The histogram can also be used to observe and estimate which data is concentrated and where the abnormal or isolated data is distributed.
First, understand the following basic concepts.
Number of groups: When statistical data, we divide the data into several groups according to different ranges, and the number of divided groups is called the number of groups.
Group distance: The difference between the two endpoints of each group.
Frequency: The number of data elements in a group divided by the group distance.
02 Examples
A sample histogram code is shown below.
Code Example 2-45
1plot = figure(plot_width=300, plot_height=300)
2plot.quad(top=[2, 3, 4], bottom=[1, 2, 3], left=[1, 2, 3],
3 right=[1.2, 2.5, 3.7], color="#B3DE69")
4show(plot)
The running result is shown in Figure 2-59.
▲Figure 2-59 Code example 2-45 Running result
Code Example 2-45, line 2, uses the quad() method to draw a histogram by defining the four-sided boundary of the rectangle. The specific parameters are described below.
p .quad(left, right, top, bottom, **kwargs) parameter description.
left (:class:`~bokeh.core.properties.NumberSpec` ) : the left border of the histogram x-axis
right (:class:`~bokeh.core.properties.NumberSpec` ) : the right edge of the histogram x-axis
top (:class:`~bokeh.core.properties.NumberSpec` ) : the top border of the y-axis of the histogram
bottom (:class:`~bokeh.core.properties.NumberSpec` ) : the bottom border of the y-axis of the histogram
Other parameters (**kwargs) description.
alpha (float) : set the transparency of all lines at once
color (Color) : set the color of all lines at once
source (ColumnDataSource) : Bokeh-specific data format (similar to Pandas Dataframe)
legend (str) : legend of the primitive
x_range_name (str) : x-axis range name
y_range_name (str) : y-axis range name
level (Enum) : primitive rendering level
Code Example 2-46
1import numpy as np
2import scipy.special
3from bokeh.layouts import gridplot
4# 绘图函数
5def make_plot(title, hist, edges, x, pdf, cdf):
6 p = figure(title=title, tools='', background_fill_color="#fafafa")
7 p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
8 fill_color="navy", line_color="white", alpha=0.5)
9 p.line(x, pdf, line_color="#ff8888", line_width=4, alpha=0.7, legend="PDF")
10 p.line(x, cdf, line_color="orange", line_width=2, alpha=0.7, legend="CDF")
11
12 p.y_range.start = 0
13 p.legend.location = "center_right"
14 p.legend.background_fill_color = "#fefefe"
15 p.xaxis.axis_label = 'x'
16 p.yaxis.axis_label = 'Pr(x)'
17 p.grid.grid_line_color="white"
18 return p
19# 正态分布
20mu, sigma = 0, 0.5
21measured = np.random.normal(mu, sigma, 1000)
22hist, edges = np.histogram(measured, density=True, bins=50)
23x = np.linspace(-2, 2, 1000)
24# 拟合曲线
25pdf = 1/(sigma * np.sqrt(2*np.pi)) * np.exp(-(x-mu)**2 / (2*sigma**2))
26cdf = (1+scipy.special.erf((x-mu)/np.sqrt(2*sigma**2)))/2
27p1 = make_plot("Normal Distribution (μ=0, σ=0.5)", hist, edges, x, pdf, cdf)
28# 对数正态分布
29mu, sigma = 0, 0.5
30measured = np.random.lognormal(mu, sigma, 1000)
31hist, edges = np.histogram(measured, density=True, bins=50)
32x = np.linspace(0.0001, 8.0, 1000)
33pdf = 1/(x* sigma * np.sqrt(2*np.pi)) * np.exp(-(np.log(x)-mu)**2 / (2*sigma**2))
34cdf = (1+scipy.special.erf((np.log(x)-mu)/(np.sqrt(2)*sigma)))/2
35p2 = make_plot("Log Normal Distribution (μ=0, σ=0.5)", hist, edges, x, pdf, cdf)
36# 伽玛分布
37k, theta = 7.5, 1.0
38measured = np.random.gamma(k, theta, 1000)
39hist, edges = np.histogram(measured, density=True, bins=50)
40x = np.linspace(0.0001, 20.0, 1000)
41pdf = x**(k-1) * np.exp(-x/theta) / (theta**k * scipy.special.gamma(k))
42cdf = scipy.special.gammainc(k, x/theta)
43p3 = make_plot("Gamma Distribution (k=7.5, θ=1)", hist, edges, x, pdf, cdf)
44# 韦伯分布
45lam, k = 1, 1.25
46measured = lam*(-np.log(np.random.uniform(0, 1, 1000)))**(1/k)
47hist, edges = np.histogram(measured, density=True, bins=50)
48x = np.linspace(0.0001, 8, 1000)
49pdf = (k/lam)*(x/lam)**(k-1) * np.exp(-(x/lam)**k)
50cdf = 1 - np.exp(-(x/lam)**k)
51p4 = make_plot("Weibull Distribution (λ=1, k=1.25)", hist, edges, x, pdf, cdf)
52# 显示
53show(gridplot([p1,p2,p3,p4], ncols=2, plot_width=400, plot_height=400, toolbar_location=None))
The running result is shown in Figure 2-60.
▲Figure 2-60 Code example 2-46 Running result
Code Example 2-46 Line 5 Customize the plotting function make_plot (title, hist, edges, x, pdf, cdf), where the parameters are the title of the plot, the top border of the histogram, the left and right borders, the x-coordinate of the fitted curve, and the method By defining the four-sided boundary of the rectangle, PDF is the probability density function, and CDF is the cumulative distribution function. Line 53 displays 4 graphs (normal distribution, lognormal distribution, gamma distribution, and Weber distribution) at a time through the gridplot() method.
About the author: Qu Xifeng, senior Python engineer, practitioner and evangelist in the field of Bokeh, has in-depth research on Bokeh. Good at Flask, MongoDB, Sklearn and other technologies, rich practical experience. Known as the author of several columns (Python Chinese community, Python programmer, big data analysis and mining), the column has a total of more than 100,000 users.
This article is excerpted from "Python Data Visualization: Visual Drawing Based on Bokeh", published with the permission of the publisher.
Further reading "Python Data Visualization"
Click the image above to learn and buy
For reprint, please contact WeChat: DoctorData
Recommendation: Comprehensively explain Bokeh functions and usage from the dimensions of graphics drawing, data dynamic display, and Web interaction, excluding complex data processing and algorithms, explain the profound things in simple language, suitable for zero-based entry, and include a large number of cases.
Have something to say ????
Q: What data relationships do you use histograms to show?
Welcome to leave a message to share with you
Guess what you want to see ????
More exciting????
Enter the following keywords in the official account dialog box
Check out more premium content!
PPT | Reading | Book List | Hard Core | Dry Goods
Big Data | Demystified | Python | Visualization
AI | Artificial Intelligence | 5G | Middle Station
Machine Learning | Deep Learning | Neural Networks
Partner | 1024 | Great God | Math
According to statistics, 99% of the big coffee have completed this divine operation
????