python 爬虫 response得到乱码

这个问题折磨了我几乎一天，好在我倔强地不停搜索解决方法。

“终于等到你，还好我没放弃。”

进入正题，感谢大神的分享，开个传送门：https://www.cnblogs.com/leomo/p/6869230.html

以下为代码,爬取汉字“一”的篆书字，得到网页源代码：

import requests

#使用post方法爬取网页信息

url = 'http://www.diyiziti.com/Builder'
data = {'Content':urllib2.quote('一'),
        'FontInfoId':Sort}
headers = {'content-type': 'charset=utf8'}
response = requests.post(url = url, data = data, headers=headers)
print(response.content)

过程：

当我使用get方法不传入参数时，打印其得到的网页的编码格式。

url = 'http://www.diyiziti.com/Builder'
response = requests.get(url)
print(response.encoding)

>>>utf-8

得到结果：utf-8

但是当我用post方法传入参数进去，打印其得到的网页的编码格式。

url = 'http://www.diyiziti.com/Builder'
data = {'Content':urllib2.quote(wd),'FontInfoId':Sort}
response = requests.post(url=url,data = data)
print(response.encoding)

>>>None

得到结果：None

百思不得其解，直到看到大神的解决方法，明白了当我输入数据得到响应后的网页源码时，它并未指定编码方式。

文章：https://blog.csdn.net/sentimental_dog/article/details/52661974 中指出

“官方文档的意思就是，如果requests没有发现http headers中的charset，就会使用默认的IOS-8859-1(也就是我们常说的latin-1，但是我们一般的网页使用的charset其实是utf-8)这会导致什么结果呢？”

详细的解释大家可以进入上面链接查看。总而言之就是导致编码、解码不正确，因此出现乱码。

所以本文重点是

使用headers = {'content-type': 'charset=utf8'}，

通过配置header 设置编码解决问题。

python 爬虫 response得到乱码

“官方文档的意思就是，如果requests没有发现http headers中的charset，就会使用默认的IOS-8859-1(也就是我们常说的latin-1，但是我们一般的网页使用的charset其实是utf-8)这会导致什么结果呢？”

猜你喜欢