Teach you how to draw a histogram with Python: in fact, it is completely different from a histogram


Introduction: Histograms and histograms are very common and commonly used charts in data analysis. Since the two look very similar in appearance, it is inevitable to cause some confusion. We have previously discussed the difference between a column chart, a stacked column chart, and a waterfall chart? How to draw with Python? "In the article, I took you to understand the histogram, and today we will talk about the histogram.

Author: Qu Xifeng, senior Python engineer, Zhihu columnist

Source: Big Data DT (ID: hzdashuju)

01 Overview

Histogram, similar in shape to a bar chart but has a completely different meaning from a bar chart. The histogram involves the concept of statistics. First, the data should be grouped, and then the number of data elements in each group should be counted. In the plane rectangular coordinate system, the horizontal axis marks the endpoint of each group, the vertical axis represents the frequency, and the height of each rectangle represents the corresponding frequency. Such a statistical graph is called a frequency distribution histogram.

The frequency distribution histogram needs to go through the calculation process of multiplying the frequency by the group distance to obtain the number of each group. The group distance of the same histogram is a fixed value, so if the number is directly represented by the vertical axis, each rectangle The high value represents the corresponding number of data elements, which can not only keep the distribution state unchanged, but also intuitively see the number of each group, as shown in Figure 2-58.

▲Figure 2-58 Histogram

The histogram can also be used to observe and estimate which data is concentrated and where the abnormal or isolated data is distributed.

First, understand the following basic concepts.

  • Number of groups: When statistical data, we divide the data into several groups according to different ranges, and the number of divided groups is called the number of groups.

  • Group distance: The difference between the two endpoints of each group.

  • Frequency: The number of data elements in a group divided by the group distance.

02 Examples

A sample histogram code is shown below.

  • Code Example 2-45

1plot = figure(plot_width=300, plot_height=300)  
2plot.quad(top=[2, 3, 4], bottom=[1, 2, 3], left=[1, 2, 3],  
3          right=[1.2, 2.5, 3.7], color="#B3DE69")  
4show(plot)

The running result is shown in Figure 2-59.

▲Figure 2-59 Code example 2-45 Running result

Code Example 2-45, line 2, uses the quad() method to draw a histogram by defining the four-sided boundary of the rectangle. The specific parameters are described below.

p .quad(left, right, top, bottom, **kwargs) parameter description.

  • left (:class:`~bokeh.core.properties.NumberSpec` ) : the left border of the histogram x-axis

  • right (:class:`~bokeh.core.properties.NumberSpec` ) : the right edge of the histogram x-axis

  • top (:class:`~bokeh.core.properties.NumberSpec` ) : the top border of the y-axis of the histogram

  • bottom (:class:`~bokeh.core.properties.NumberSpec` ) : the bottom border of the y-axis of the histogram

Other parameters (**kwargs) description.

  • alpha (float) : set the transparency of all lines at once

  • color (Color) : set the color of all lines at once

  • source (ColumnDataSource) : Bokeh-specific data format (similar to Pandas Dataframe)

  • legend (str) : legend of the primitive

  • x_range_name (str) : x-axis range name

  • y_range_name (str) : y-axis range name

  • level (Enum) : primitive rendering level

  • Code Example 2-46

 1import numpy as np  
 2import scipy.special  
 3from bokeh.layouts import gridplot  
 4# 绘图函数  
 5def make_plot(title, hist, edges, x, pdf, cdf):  
 6    p = figure(title=title, tools='', background_fill_color="#fafafa")  
 7    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],  
 8           fill_color="navy", line_color="white", alpha=0.5)  
 9    p.line(x, pdf, line_color="#ff8888", line_width=4, alpha=0.7, legend="PDF")
10    p.line(x, cdf, line_color="orange", line_width=2, alpha=0.7, legend="CDF")
11
12    p.y_range.start = 0  
13    p.legend.location = "center_right"  
14    p.legend.background_fill_color = "#fefefe"  
15    p.xaxis.axis_label = 'x'  
16    p.yaxis.axis_label = 'Pr(x)'  
17    p.grid.grid_line_color="white"  
18    return p  
19# 正态分布  
20mu, sigma = 0, 0.5  
21measured = np.random.normal(mu, sigma, 1000)  
22hist, edges = np.histogram(measured, density=True, bins=50)  
23x = np.linspace(-2, 2, 1000)  
24# 拟合曲线  
25pdf = 1/(sigma * np.sqrt(2*np.pi)) * np.exp(-(x-mu)**2 / (2*sigma**2))  
26cdf = (1+scipy.special.erf((x-mu)/np.sqrt(2*sigma**2)))/2  
27p1 = make_plot("Normal Distribution (μ=0, σ=0.5)", hist, edges, x, pdf, cdf)
28# 对数正态分布  
29mu, sigma = 0, 0.5  
30measured = np.random.lognormal(mu, sigma, 1000)  
31hist, edges = np.histogram(measured, density=True, bins=50)  
32x = np.linspace(0.0001, 8.0, 1000)  
33pdf = 1/(x* sigma * np.sqrt(2*np.pi)) * np.exp(-(np.log(x)-mu)**2 / (2*sigma**2))  
34cdf = (1+scipy.special.erf((np.log(x)-mu)/(np.sqrt(2)*sigma)))/2  
35p2 = make_plot("Log Normal Distribution (μ=0, σ=0.5)", hist, edges, x, pdf, cdf)
36# 伽玛分布  
37k, theta = 7.5, 1.0  
38measured = np.random.gamma(k, theta, 1000)  
39hist, edges = np.histogram(measured, density=True, bins=50)  
40x = np.linspace(0.0001, 20.0, 1000)  
41pdf = x**(k-1) * np.exp(-x/theta) / (theta**k * scipy.special.gamma(k))  
42cdf = scipy.special.gammainc(k, x/theta)  
43p3 = make_plot("Gamma Distribution (k=7.5, θ=1)", hist, edges, x, pdf, cdf)  
44# 韦伯分布  
45lam, k = 1, 1.25  
46measured = lam*(-np.log(np.random.uniform(0, 1, 1000)))**(1/k)  
47hist, edges = np.histogram(measured, density=True, bins=50)  
48x = np.linspace(0.0001, 8, 1000)  
49pdf = (k/lam)*(x/lam)**(k-1) * np.exp(-(x/lam)**k)  
50cdf = 1 - np.exp(-(x/lam)**k)  
51p4 = make_plot("Weibull Distribution (λ=1, k=1.25)", hist, edges, x, pdf, cdf)
52# 显示  
53show(gridplot([p1,p2,p3,p4], ncols=2, plot_width=400, plot_height=400, toolbar_location=None))

The running result is shown in Figure 2-60.

▲Figure 2-60 Code example 2-46 Running result

Code Example 2-46 Line 5 Customize the plotting function make_plot (title, hist, edges, x, pdf, cdf), where the parameters are the title of the plot, the top border of the histogram, the left and right borders, the x-coordinate of the fitted curve, and the method By defining the four-sided boundary of the rectangle, PDF is the probability density function, and CDF is the cumulative distribution function. Line 53 displays 4 graphs (normal distribution, lognormal distribution, gamma distribution, and Weber distribution) at a time through the gridplot() method.

About the author: Qu Xifeng, senior Python engineer, practitioner and evangelist in the field of Bokeh, has in-depth research on Bokeh. Good at Flask, MongoDB, Sklearn and other technologies, rich practical experience. Known as the author of several columns (Python Chinese community, Python programmer, big data analysis and mining), the column has a total of more than 100,000 users.

This article is excerpted from "Python Data Visualization: Visual Drawing Based on Bokeh", published with the permission of the publisher.

Further reading "Python Data Visualization"

Click the image above to learn and buy

For reprint, please contact WeChat: DoctorData

Recommendation: Comprehensively explain Bokeh functions and usage from the dimensions of graphics drawing, data dynamic display, and Web interaction, excluding complex data processing and algorithms, explain the profound things in simple language, suitable for zero-based entry, and include a large number of cases.

Have something to say ????

Q: What data relationships do you use histograms to show?

Welcome to leave a message to share with you

Guess what you want to see ????

More exciting????

Enter the following keywords in the official account dialog box

Check out more premium content!

PPT  |  Reading  |  Book List  |  Hard Core  |  Dry Goods 

Big Data  |  Demystified  |  Python  |  Visualization

AI  |  Artificial Intelligence  |  5G  |  Middle Station

Machine Learning  |  Deep Learning  |  Neural Networks

Partner 1024  |  Great God  |  Math

According to statistics, 99% of the big coffee have completed this divine operation

????

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324337898&siteId=291194637
Recommended