The editor promised a friend that he wanted to crawl a certain paper, and then the crawled content turned out to be garbled? So the editor asked the teacher and took notes to summarize my learning journey.
Python handles garbled characters
There are always some gaps in the code executed by oneself according to their own ideas, and then the code executed by the computer. The specific situation is understood by all the friends who have learned it.
phenomenon
Let's talk about the solution below. We must first consider what causes the garbled content that we get. It may be that the decoding method of requests.text is incorrect, depending on the character encoding method in html. Click F12 -> click into Console -> enter document.charset as shown in the figure, and the encoding format is "GBK". At
this time, we have to look at the requests library in pypi.org. (The teacher said that the official website is authoritative. If the website is not good, please read the blog written by Daniel on CSDN)
So we need to manually set the encoding method:
'''
作者:ls富
时间:2021/1/9
'''
import requests
from bs4 import BeautifulSoup#导入模块
url="https://www.unjs.com/lunwen/f/20191111001204_2225087.html"
headers= {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36 Edg/87.0.664.47"
}#模拟的服务器头
html = requests.get(url,headers=headers)
#html.content="utf-8"#对编码以utf-8
Html=html.text.encode('iso-8859-1').decode('gbk')#对编码格式为gbk方式读取
soup=BeautifulSoup(Html,'html.parser') # BeautifulSoup打看网页
soupl = soup.select(".title")#css选择器选择需要内容
print(soupl)
Phenomenon:
If it helps you, don’t forget to like, follow, and add to favorites!