This series is a compilation of notes for the introductory book "Python Programming: From Getting Started to Practice", which belongs to the primary content. The title sequence follows the title of the book.
This article is the second article in Python data processing. This article will use the data downloaded from the Internet to visualize the data.
1 Introduction
This article will access and visualize data stored in two common formats: CSV and JSON:
- Use a Python
csv
module to process weather data stored in CSV (comma-separated values) format to find the maximum and minimum temperatures over a period of time in two different regions; - Use the
json
module to access trade close data stored in JSON format.
The data in this article can be downloaded from the official website of the book ( http://www.ituring.com.cn/book/1861 ).
2. CSV file format
Create a new project, death_valley_2014.csv
copy the file to the project root directory, and create a new highs_lows.py
file, change the program to read the temperature data of Death Valley, California in 2014, extract the daily maximum and minimum temperature, and draw a line graph:
import csv
from datetime import datetime
from matplotlib import pyplot as plt
filename = "death_valley_2014.csv"
with open(filename) as f:
reader = csv.reader(f)
header_row = next(reader)
dates, highs, lows = [], [], []
for row in reader:
try:
current_date = datetime.strptime(row[0], "%Y-%m-%d")
high = int(row[1])
low = int(row[3])
except ValueError:
print(current_date, "missing data")
else:
dates.append(current_date)
highs.append(high)
lows.append(low)
fig = plt.figure(dpi=141, figsize=(10, 6))
# 绘制最高气温折线图
plt.plot(dates, highs, c="red")
# 绘制最低气温折线图
plt.plot(dates, lows, c="blue")
# 填充两个折现之间的空间,alpha为透明度,0为全透明,1为不透明
plt.fill_between(dates, highs, lows, facecolor="blue", alpha=0.1)
plt.title("Daily high and low temperatures - 2014\nDeath Valley, CA", fontsize=20)
plt.xlabel("", fontsize=16)
# 自动排版x轴的日期数据,避免重叠
fig.autofmt_xdate()
plt.ylabel("Temperature(F)", fontsize=16)
plt.tick_params(axis="both", which="major", labelsize=16)
plt.show()
The code now opens the file, then csv.reader()
creates a CSV file reader through the function, the parameter is the file just opened; next()
reads a line of the file through the function, and automatically converts the data into a list; and then for
reads all the data through a loop. for
Error checking is also added to the loop in case the program terminates due to issues such as data loss in the file. We also fill_between()
color the area between the two discounts by the function. The resulting image is as follows:
At the same time, we also get a message output:
2014-02-16 00:00:00 missing data
That is, the data for that day is lost.
3. Make a trading closing price chart: JSON format
It will now be btc_close_2017.json
copied to the project root directory. In this section, 5 images will be drawn: line chart of closing prices, logarithmic transformation of closing prices, monthly average of closing prices, weekly average of closing prices, and weekly average of closing prices. are used to Pygal
draw.
3.1 Draw the closing price line chart
import json
import pygal
# 将数据加载到一个列表中,列表中的元素是字典
filename = "btc_close_2017.json"
with open(filename) as f:
btc_data = json.load(f)
dates, months, weeks, weekdays, close = [], [], [], [], []
for btc_dict in btc_data:
dates.append(btc_dict["date"])
months.append(int(btc_dict["month"]))
weeks.append(int(btc_dict["week"]))
weekdays.append(btc_dict["weekday"])
close.append(int(float(btc_dict["close"])))
# x轴坐标上的刻度顺时针旋转20度
line_chart = pygal.Line(x_label_rotation=20, show_minor_x_labels=False)
line_chart.title = "收盘价(¥)"
line_chart.x_labels = dates
N = 20 # x轴坐标每隔20天显示一次
line_chart.x_labels_major = dates[::N]
line_chart.add("收盘价", close)
line_chart.render_to_file("收盘价折线图(¥).svg")
The resulting image is as follows:
3.2 Logarithmic transformation of closing price
As you can see from the chart above, the close is basically exponential, but there are some similar fluctuations (March, June, September). Although these fluctuations are masked by a growing trend, perhaps there is a cyclicality in them. To test the periodicity hypothesis, the nonlinear trend needs to be eliminated first. Logarithmic transformation is one of the commonly used processing methods. We use modules from the Python standard library math
to solve this problem.
-- snip --
import math
line_chart = pygal.Line(x_label_rotation=20, show_minor_x_labels=False)
line_chart.title = "收盘价对数变换(¥)"
line_chart.x_labels = dates
N = 20 # x轴坐标每隔20天显示一次
line_chart.x_labels_major = dates[::N]
# 对数变换
close_log = [math.log10(_) for _ in close]
line_chart.add("log收盘价", close_log)
line_chart.render_to_file("收盘价对数变换折线图(¥).svg")
Got the following image:
It can be seen that there were sharp fluctuations in March, June and September. Let's take a look at the monthly daily average and Sunday average of closing prices.
3.3 Average closing price
3.3.1 Monthly Daily Average
Before continuing with the new code, some additional knowledge is required: for zip()
functions, it forms a new list from multiple lists according to the position of the elements, and the elements of the new list are tuples. as follows:
# 代码
a = [1, 2, 3]
b = [4, 5, 6]
c = [7, 8, 9, 10]
zipped_1 = zip(a,b)
zipped_2 = zip(a, b, c)
print(zipped_1)
print(list(zipped_1))
print(list(zipped_2))
# 结果
<zip object at 0x0000021D732DCDC8>
[(1, 4), (2, 5), (3, 6)]
[(1, 4, 7), (2, 5, 8), (3, 6, 9)]
In python2, zip()
a list is returned directly, but in python3, zip()
an iterable zip
object is returned, here we convert it to a list. zip
Also "unpack" (unpack) the object by preceding it with an asterisk :
# 代码:
print(*zipped_1)
# 结果:
(1, 4) (2, 5) (3, 6)
Asterisks can not only zip
unpack objects, but also list
unpack equivalent types.
We will also use groupby()
the function, but before using the function, we need to sort the list. We use sorted()
functions to sort. In python3, sorted()
functions are compared in element order by default. For example, the elements of the list here are tuples, then sorted()
compare the value of the first element in the tuple first, and then compare the value of the second element, as follows:
# 代码:
test = [(1, 5), (1, 4), (1, 3), (1, 2), (2, 3)]
print(sorted(test))
# 结果:
[(1, 2), (1, 3), (1, 4), (1, 5), (2, 3)]
This data is then grouped by a groupby()
function, specified by a keyword argument key=itemgetter(0)
to group by the first value of a list element (i.e. a tuple). It is also possible to itemgetter()
replace the function here with an lambda
expression, such as the equivalent lambda
expression is lambda x: x[0]
. In python3, groupby()
return an iterable groupby
object, if you convert it to list
, list
the second value of each element in is also an iterable object:
# 代码:
test = [(1, 5), (1, 4), (1, 3), (1, 2), (2, 4), (2, 3), (3, 5)]
temp = groupby(sorted(test), key=itemgetter(0))
print(temp)
print(list(temp))
for a, b in temp:
print(list(b))
# 结果:
<itertools.groupby object at 0x0000013CD9A4D458>
[(1, <itertools._grouper object at 0x0000013CE8AAE160>),
(2, <itertools._grouper object at 0x0000013CE8AAE128>),
(3, <itertools._grouper object at 0x0000013CE8AAE198>)]
[(1, 2), (1, 3), (1, 4), (1, 5)]
[(2, 3), (2, 4)]
[(3, 5)]
From for
the results of the above loop, the groupby()
returned object can be regarded as a dictionary whose keys are the above key
values, and the values of the dictionary are some of the elements in the list before grouping (may form a list, or may form tuples).
Now let's get down to business, back to the main thread.
Plot the daily average for the first 11 months of 2017, the daily average for the previous 49 weeks, and the daily average for each day of the week (Monday~Sunday). First we need to encapsulate some code:
from itertools import groupby
from operator import itemgetter
def draw_line(x_data, y_data, title, y_legend):
xy_map = []
# 本段见后面解释
for x, y in groupby(sorted(zip(x_data, y_data)), key=itemgetter(0)):
y_list = [v for _, v in y]
xy_map.append([x, sum(y_list) / len(y_list)])
x_unique, y_mean = [*zip(*xy_map)]
line_chart = pygal.Line()
line_chart.title = title
line_chart.x_labels = x_unique
line_chart.add(y_legend, y_mean)
line_chart.render_to_file(title + ".svg")
return line_chart
This code is a bit twisted. As can be seen from the previous introduction for
, the variable in the loop is y
equivalent to one list
, and the element of this list
is the first element of tuple
, tuple
the first element of x_data
which is the value in , and it is no longer necessary to repeat it, so it is composed of the second value list
, that is, the 8th line of code. xy_map
is an list
object, and so are its elements list
, that is, it is a two-dimensional array. Pay attention to the operation in line 10, which *xy_map
will list
be unpacked. zip()
The function will pack the unpacked elements into an zip
object again. If it is regarded as an list
object, the object contains two tuple
elements, and then the zip
object is also unpacked, and the outermost Set another layer list
to get one with two tuple
elements, list
and finally assign them in parallel. In order to more specifically reflect this operation, the following is simulated with some simple data:
# 代码:
temp = [[1, 2], [3, 4], [5, 6]]
x, y = [*zip(*temp)]
print(x)
print(y)
# 结果:
(1, 3, 5)
(2, 4, 6)
Finally, it's finally time to draw:
-- 读取文件内容的代码和前面一样 --
idx_month = dates.index("2017-12-01")
line_chart_month = draw_line(months[:idx_month], close[:idx_month],
"收盘价月日均值(¥)", "月日均值")
The result obtained is as follows:
3.3.2 Weekly Average
The first week of 2017 starts on January 2, 2017, and the 49th Sunday is December 10, 2017.
-- 读取文件内容的代码和前面一样 --
idx_week = dates.index("2017-12-11")
line_chart_week = draw_line(weeks[1:idx_week], close[1:idx_week], "收盘价周日均值(¥)", "周日均值")
The result is as follows:
3.3.3 Average value for each day of the week
If you directly use weekdays
this list to generate a chart, since the list stores strings and is sorted by ASCII
code, the week order of the last generated chart will be wrong, so it is converted into a number.
idx_week = dates.index("2017-12-11")
wd = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday",
"Sunday"]
weekdays_int = [wd.index(w) + 1 for w in weekdays[1:idx_week]]
line_chart_weekday = draw_line(weekdays_int, close[1:idx_week], "收盘价星期均值(¥)", "星期均值")
line_chart_weekday.x_labels = ["周一", "周二", "周三", "周四", "周五", "周六", "周日"]
line_chart_weekday.render_to_file("收盘价星期均值(¥).svg")
The final result is as follows:
3.4 Closing Price Data Dashboard
Finally, we integrated the five tables into one file to make a dashboard:
with open('收盘价Dashboard.html', 'w', encoding='utf8') as html_file:
title = '<html><head><title>收盘价Dashboard</title><meta charset="utf-8"></head><body>\n'
html_file.write(title)
for svg in [
'收盘价折线图(¥).svg', '收盘价对数变换折线图(¥).svg', '收盘价月日均值(¥).svg',
'收盘价周日均值(¥).svg', '收盘价星期均值(¥).svg'
]:
html_file.write(
' <object type="image/svg+xml" data="{0}" height=500></object>\n'.format(svg))
html_file.write('</body></html>')
The effect is as follows:
This is the effect of enlarging the browser. If the default is 100%, these five pictures are all on the same line and are very small.
4. Summary
The main contents of this article are:
- How to use datasets on the web;
- How to process CSV and JSON files, and how to extract the data you are interested in;
- How to use
matplotlib
to process past weather data, including how to usedatetime
modules, and how to plot multiple data series in the same chart; - How
json
modules to access trade closing price data stored in JSON format and usePygal
plot graphs to explore the cyclicality of price changes, and how to combinePygal
graphs into data dashboards.
The next article will collect data from the web and visualize it.