Python extra: crawl all the code in the CSDN blog post

Hello, everyone, I am wangzirui32, today we will learn how to crawl all the codes in the CSDN blog post.

1. Analyze the source code of the web page

As shown in the figure:
analysis

2. Write the code

The code is as follows: (do not understand the comments)

# 导入所需模块
from requests import get
from bs4 import BeautifulSoup as bs
import re # 正则表达式模块

# 伪造请求头
headers = {
    
    
           "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Mobile Safari/537.36",
		   "Host": "blog.csdn.net"
}

# 输入文章链接
article_link = input("请输入文章链接:")

try:
	# 请求网址
    r = get(article_link, headers=headers)
except Exception:
    print("错误!请重新再试!")
else:
    soup = bs(r.text, "html.parser")
	
	# 获取所有code标签
    codes = soup.find_all("code")
    # 设置代码id
    code_id = 1

	# 遍历codes
    for code in codes:
    	# re.sub方法主要用来替换某个内容
    	# 这里需要替换的内容为<> 中间的(.*?)指任意内容
    	# 第二个参数是替换成的内容 ""相当于删除
    	# 第三个参数是要替换的字符串
        code_content = re.sub(r"<(.*?)>", "", code.text)

		# 打开文件 写入代码 声明编码为UTF-8
        with open("code_" + str(code_id) + ".txt", "w", encoding="UTF-8") as f:
            f.write(code_content)

        code_id += 1

However, this code can only crawl content other than the html source code. This is because "<(.*?)>" will delete all html tags (including the displayed code). Here I will give a solution , Is to put

code_content = re.sub(r"<(.*?)>", "", code.text)

Replace with

code_content = re.sub(r"<span class='token(.*?)'>", "", code.text)

Because all the style settings are in the span tag, the class attribute mostly starts with token, so write it like this. If you have a better solution, you can leave a message in the comment area!


3. Run the code

Run the code and follow the prompts to enter:

请输入文章链接:https://blog.csdn.net/wangzirui32/article/details/113871177

After the operation is over, open the directory where the code file is located, and you should see two text files, code_1 and code_2, and you can see the code when you open it.


That's all for today, have you learned it? If you like this article, you can like and collect it, bye!

Guess you like

Origin blog.csdn.net/wangzirui32/article/details/115052489