The content crawled by python crawler is garbled (solution)

The editor promised a friend that he wanted to crawl a certain paper, and then the crawled content turned out to be garbled? So the editor asked the teacher and took notes to summarize my learning journey.

Python handles garbled characters

There are always some gaps in the code executed by oneself according to their own ideas, and then the code executed by the computer. The specific situation is understood by all the friends who have learned it.
Insert picture description here
phenomenonInsert picture description here

Let's talk about the solution below. We must first consider what causes the garbled content that we get. It may be that the decoding method of requests.text is incorrect, depending on the character encoding method in html. Click F12 -> click into Console -> enter document.charset as shown in the figure, and the encoding format is "GBK". At Insert picture description here
this time, we have to look at the requests library in pypi.org. (The teacher said that the official website is authoritative. If the website is not good, please read the blog written by Daniel on CSDN)
Insert picture description here
So we need to manually set the encoding method:

'''
	作者:ls富
	时间:2021/1/9
'''
import requests
from bs4 import BeautifulSoup#导入模块


url="https://www.unjs.com/lunwen/f/20191111001204_2225087.html"
headers= {
            'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36 Edg/87.0.664.47"
}#模拟的服务器头



html = requests.get(url,headers=headers)

#html.content="utf-8"#对编码以utf-8

Html=html.text.encode('iso-8859-1').decode('gbk')#对编码格式为gbk方式读取
soup=BeautifulSoup(Html,'html.parser')  # BeautifulSoup打看网页
soupl = soup.select(".title")#css选择器选择需要内容
print(soupl)

Phenomenon:
Insert picture description here
If it helps you, don’t forget to like, follow, and add to favorites!

Guess you like

Origin blog.csdn.net/weixin_47514459/article/details/112390388