Python crawler course design - crawl 3000 pieces of data and do data visualization

The list I received a long time ago, share it

Work requirements

"Python and Data Analysis" final homework requirements (2020-2021 academic year 2nd semester)
1. Final homework requirements:
1. On the basis of the data crawled in the previous homework, write code in Python to conduct comprehensive data analysis on the crawled data And visualization, encourage the establishment of a measurement model for analysis;
2. Write the final work document: the overall idea, the analysis of the crawling website, what aspects of the data analysis and data visualization, and conclusions.
3. You must write your own crawler program. It is not allowed to use crawler frameworks (such as scrapy) to crawl data, and plagiarism is strictly prohibited.
2. Submit:
1. The crawler program code (preliminary work) file
(.ipynb), plus the necessary notes or notes;
2. The data analysis, visualization code file (.ipynb), plus the necessary notes or notes ;
3. Captured data files and intermediate files generated by data analysis and visualization;
4. Final work documents.

data scraping

# 用于爬取信息
import requests
# 用于解析网页
from bs4 import BeautifulSoup
# 用于正则匹配找到目标项目
import re
# 对csv文件的操作
import csv

# 打开文件
# a+权限追加写入
# newline=""用于取消自动换行
fp = open("data.csv", "a+", newline="")
# 修饰,处理成支持scv读取的文件
csv_fp = csv.writer(fp)
# 设置csv文件内标题头
head = ['日期', '最高气温', '最低气温']
# 写入标题
csv_fp.writerow(head)

# UA伪装
headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"
}

# 存放全部数据
data = []

# 进行url拼接,主要拼接的是年份和月份
# 从2011年到2020年
for i in range(2011, 2021):
    # 从1月到12月
    for j in range(1, 13):
        # 字符串化
        i = str(i)
        # 小于10则补0
        if j < 10:
            j = "0" + str(j)
        else:
            # 字符串化
            j = str(j)
        # 完成拼接
        url = "http://www.tianqihoubao.com/lishi/beijing/month/" + i + j + ".html"

        # 获取响应
        response = requests.get(url=url, headers=headers)
        # 设置编码为gbk
        response.encoding = 'gbk'
        # 获取响应文本数据
        page = response.text
        # 用BeautifulSoup解析网页
        soup = BeautifulSoup(page, 'lxml')
        # 获取所有tr标签
        tr_list = soup.find_all('tr')

        # 解析每一个tr标签
        for tr in tr_list:
            # 用于存放一天的数据
            one_day = []
            # 字符串化便于正则匹配
            tr = str(tr)
            # 去除所有空格
            tr = tr.replace(" ", "")
            # 取出日期
            date = re.findall(r'title="(.*?)北京天气预报">', tr)
            # 如果取到则放入one——day存放
            if date:
                one_day.append(date[0])
            # 取出最高温和最低温
            tem = re.findall(r'(.*?)℃', tr)
            # 如果取到则放入one——day存放
            if tem:
                one_day.append(tem[0])
                one_day.append(tem[1])
            # 如果完整的取到一天的数据则放入data存放
            if len(one_day) == 3:
                data.append(one_day)
                print(one_day)
                # 写入csv文件
                csv_fp.writerow(one_day)

# 关闭文件指针
fp.close()

Crawl results

insert image description here

data processing

# 读取csv文件
import csv
# 作图工具
from matplotlib import pyplot as plt

# 存放日期
x = []
# 存放最高气温
h = []
# 存放最低气温
l = []
# 读取之前爬取的数据
with open("data.csv") as f:
    reader = csv.reader(f)
    j = 1
    for i, rows in enumerate(reader):
        # 不要标题那一行
        if i:
            row = rows
            print(row)
            x.append(rows[0])
            h.append(int(rows[1]))
            l.append(int(rows[2]))
# 设置画板大小
fig = plt.figure(dpi=128, figsize=(20, 6))
# 显示中文标签
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# 画最高气温
plt.plot(x, h, c="red", alpha=0.5)
# 画最低气温
plt.plot(x, l, c="blue", alpha=0.5)
# 区间渲染
plt.fill_between(x, h, l, facecolor="blue", alpha=0.2)
# 标题
plt.title("北京市过去3658天的气温变化")
# y轴名称
plt.ylabel("气温")
# x轴名称
plt.xlabel("日期")
plt.xticks(x[::300])
plt.show()

data visualization

Please add image description

big job document

The overall idea
is to crawl the highest temperature and lowest temperature in Beijing in the past 9 years (3658 days) from the weather post-guarantee website, and use matplotlib to draw a line graph to analyze the weather trend.
Website analysis
1. The website can only query the weather for one month at a time. 2. The query
data is located in the table form, and the tr tag has no attributes. When processing each tr with regular matching, it is necessary to determine whether the data, date, maximum temperature, and minimum temperature are queried. At that time, it will be considered as a success and will be stored in the list
data analysis
of the total data. 1. Through the temperature changes in the past 3658 days, it can be seen that the annual temperature difference between cold and summer in Beijing is basically stable at about 50 degrees, while the temperature difference between cold and summer in 2015 exceeded 60 degrees.
Conclusion
1. The temperature change in Beijing basically conforms to the law

Guess you like

Origin blog.csdn.net/qq_50216270/article/details/119876947