Requests Python crawler decoding module and content of the text

Requests to say today about the module to crawl Web content decoding mode

The difference is response.text and response.content

The method will get and post requests object returns a Response object, the object which is stored all the information returned by the server, a response including the head, the response status codes. Wherein the reversion back to the web part and there will be .content .text Liang objects in
That the difference between the two, the intermediate content is stored byte stream data , and text are stored in accordance with their encoding module requests the content guess the content encoded into Unicode
 
We often use requests.content output of content is to be decoded (because the content on the page is encoded with, but in Python Unicode strings are present form, of course, we want to see those strings, do not want to see the mess of bytes, so we climbed down the things it needs to decode)
 
Then we how to encode it (the focus of focus)
Before writing the code we should find the encoding of web pages
 

First of all to climb to get to find the page on the web page coding

Step: F12 == "" Elements (source page) == "" We found <head> charset inside the logo, the corresponding page is the encoding it

 

Get coding web pages can continue to write code for it

 1. Use content output

print(response.content.decode('utf-8'))  #decode('utf-8')的意思是以utf-8的编码的方式解码为Unicode

2. Using text output

response.encoding = 'utf-8'  #为请求的网页指定该网页的编码方式,这样text输出的时候,就不会瞎猜编码方式,而解出乱七八糟的鬼
print(response.text)

Reference code (response above are from the inside to the reference code)

import requests


kw = {'wd':'巴基斯坦'}


headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}


# params 接收一个字典或者字符串的查询参数,字典类型自动转换为url编码,不需要urlencode()
#这就是requests库的其中一个方便的地方

response = requests.get("http://www.baidu.com/s", params = kw, headers = headers)


# 查看响应内容,response.text 返回的是Unicode格式的数据
print(response.text)


# 查看响应内容,response.content返回的字节流数据
print(response.content)


# 查看完整url地址
print(response.url)


# 查看响应头部字符编码
print(response.encoding)


# 查看响应码
print(response.status_code)

 

Some see here have doubts, then I use print (response.encoding) to view the encoding of web pages

Then again not decode it? ? ? Why not? ? ? Why not? ? ? (You say you are not able to get my spasms)

 

Answer: Of course it is OK ha ha ha ha ha ha (so this is a big egg, so there are two ways to get to the page encoding it)

But sometimes it may be wrong: for example climbing Baidu with print (response.encoding) get is a single-byte ISO-8859-1 encoding

But I went to the Baidu website Press F12 to view indeed 'utf-8' encoding, then I determined to use utf-8 decoding, because the single-byte code can not be represented Chinese Oh!

So sometimes goes wrong, you can try these two methods (currently I know is that these two friends, because I was a white)

All of them are love brains hotties ..... hee hee hee hee (say so do not give me "like"?)

----------------------------------------------------------------------------------------------

 

Of course, someone will ask you, this code breaking things do not quite understand, can you help me sort sort

Ha ha ha ha ha ha, give me some more praise praise praise praise

Knowledge big run:

encode () and decode () and mean difference

English decoding means decode, encode English meaning is coded

Python string representation is internally unicode encoding , therefore, when doing transcoding, as typically required in order to unicode intermediate code , that is, first decoding other encoded string (decode) into unicode, and from unicode encoding (encode) to another encoding.

Action is to decode the encoded character string converted into another unicode encoding, such as str101.decode ( 'utf-8'), shows a conversion utf-8 encoded string of bytes to unicode encoding str101

The role is to encode unicode string encoded into other coding, such as str101.encode ( 'gb2312'), indicates to convert the unicode string str101 encoded into encoded gb2312

This means it is almost ha ha ha ha ha

Finally, offer a treasure map (helpful to you handsome Way, remember to give me a set of praise ah)

Released three original articles · won praise 2 · views 28

Guess you like

Origin blog.csdn.net/m0_46397094/article/details/105349772