table of Contents
1. Website Analysis
This article uses requests
the library to fetch a piece of embarrassments hundred dot com. Readers can click here to access the scripts embarrassments Wikipedia page. The page is shown below:
In the bottom of the page is a navigation bar with a digital link, you can switch to a different page, each page will display 25
a piece. So to achieve a multi-page piece of reptiles crawl, not only to analyze the current page HTML
of code, but also can capture multiple pages of HTML
code.
Now switch to a different page, look at URL
the law. 1,2,3 page corresponding to URL
the following:
https://www.qiushibaike.com/text/page/1/
https://www.qiushibaike.com/text/page/2/
https://www.qiushibaike.com/text/page/3/
From the URL
law can be seen, the page is indexed by URL
the last number you specify. The first 1
page number is 1
, the 12
page number is 12
, it is easy to get to any page in accordance with the law URL
. The main task now is to analyze each page of HTML
the code reader can by F12
tracking the relevant part of the developer tools in HTML
the code, as shown below:
This article is the use of 正则表达式
data parsing, do not know if 正则表达式
readers can see the bloggers this article: Python Reptile data extraction (III): Regular expressions can be Pycharm
carried out in the 正则表达式
validation.
Embarrassments Encyclopedia of HTML
code is relatively standard, the specific HTML
location is also relatively easy to find. For example, to identify the gender of the positioning HTML
of code, you can be positioned to the underlying HTML
code.
<div class="articleGender manIcon">34</div>
By manIcon
can send this piece of identifying users are men, women are womenIcon
.
2. Sample code
According to the foregoing description and implementation, for the preparation of a gripping 13
page embarrassments Encyclopedia piece of reptiles, and save the results to crawl named jokes.txt
file. The sample code is as follows:
# -*- coding: UTF-8 -*-
"""
@author:AmoXiang
@file:3.抓取糗事百科网的段子.py
@time:2020/09/11
"""
import requests
import re
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'
}
joke_list = [] # 用于存储所有的段子
# 判断性别
def verify_sex(class_name):
if class_name == "womenIcon":
return '女'
else:
return '男'
def get_joke(url):
res = requests.get(url=url, headers=headers)
text = res.text # 获取页面的html代码
print(text)
# 获取用户的ID
# \s: 匹配任何空白字符,包括空格、制表符、换页符等等
# id_list = re.findall(r'<h2>\s(.*?)\s</h2>', text) # 第一种正则表达式的写法
# re.S: 使用(.)字符匹配所有字符,包括换行符 ?: 非贪婪模式
id_list = re.findall(r'<h2>(.*?)</h2>', text, re.S) # 第二种正则表达式的写法
# 获取用户级别
level_list = re.findall(r'<div class="articleGender .*?">(\d+)</div>', text)
# 获取性别
sex_list = re.findall(r'<div class="articleGender (.*?)">\d+</div>', text)
# 获取段子内容
content_list = re.findall(r'<div class="content">.*?<span>(.*?)</span>', text, re.S)
# 获取好笑数
laugh_list = re.findall(r'<span class="stats-vote"><i class="number">(\d+)</i>', text, re.S)
# 获取评论数
comment_list = re.findall(r'<i class="number">(\d+)</i> 评论', text)
# 使用zip函数将上述获得的数据的对应索引的元素放到一起
# 例子: [1,2]、[a, b] ==>[(1,a), (2,b)] 形式
for id, level, sex, content, laugh, comment in zip(id_list, level_list, sex_list, content_list, laugh_list,
comment_list):
id = id.strip()
sex = verify_sex(sex)
content = content.strip().replace('<br/>', '')
# 获取每个段子相关的数据
info = {
'id': id,
'level': level,
'sex': sex,
'content': content,
'laugh': laugh,
'comment': comment
}
joke_list.append(info)
if __name__ == '__main__':
# 1.产生 1~13 页的URL
url_list = ["https://www.qiushibaike.com/text/page/{}/".format(i) for i in range(1, 14)]
# 2.使用循环 对13个URL进行请求 获取这13页的段子
for url in url_list:
get_joke(url)
# 将抓取结果保存在当前目录的 jokes.txt 文件中
for joke in joke_list:
f = open("./jokes.txt", 'a', encoding="utf8")
try:
f.write(joke['id'] + "\n")
f.write(joke['level'] + "\n")
f.write(joke['sex'] + "\n")
f.write(joke['content'] + "\n")
f.write(joke['laugh'] + "\n")
f.write(joke['comment'] + "\n\n")
except Exception:
pass
finally:
f.close()
Run the program, you will see the current directory one more jokes.txt
document, which reads as follows:
The above content is only for technical learning and communication. Please do not collect data for commercial use, otherwise the consequences will be at your own risk and have nothing to do with the blogger. If there is any infringement, contact the blogger to delete it.