Hello, everyone, I am wangzirui32, today we will learn how to crawl all the codes in the CSDN blog post.
1. Analyze the source code of the web page
As shown in the figure:
2. Write the code
The code is as follows: (do not understand the comments)
# 导入所需模块
from requests import get
from bs4 import BeautifulSoup as bs
import re # 正则表达式模块
# 伪造请求头
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Mobile Safari/537.36",
"Host": "blog.csdn.net"
}
# 输入文章链接
article_link = input("请输入文章链接:")
try:
# 请求网址
r = get(article_link, headers=headers)
except Exception:
print("错误!请重新再试!")
else:
soup = bs(r.text, "html.parser")
# 获取所有code标签
codes = soup.find_all("code")
# 设置代码id
code_id = 1
# 遍历codes
for code in codes:
# re.sub方法主要用来替换某个内容
# 这里需要替换的内容为<> 中间的(.*?)指任意内容
# 第二个参数是替换成的内容 ""相当于删除
# 第三个参数是要替换的字符串
code_content = re.sub(r"<(.*?)>", "", code.text)
# 打开文件 写入代码 声明编码为UTF-8
with open("code_" + str(code_id) + ".txt", "w", encoding="UTF-8") as f:
f.write(code_content)
code_id += 1
However, this code can only crawl content other than the html source code. This is because "<(.*?)>" will delete all html tags (including the displayed code). Here I will give a solution , Is to put
code_content = re.sub(r"<(.*?)>", "", code.text)
Replace with
code_content = re.sub(r"<span class='token(.*?)'>", "", code.text)
Because all the style settings are in the span tag, the class attribute mostly starts with token, so write it like this. If you have a better solution, you can leave a message in the comment area!
3. Run the code
Run the code and follow the prompts to enter:
请输入文章链接:https://blog.csdn.net/wangzirui32/article/details/113871177
After the operation is over, open the directory where the code file is located, and you should see two text files, code_1 and code_2, and you can see the code when you open it.
That's all for today, have you learned it? If you like this article, you can like and collect it, bye!