In python - 3 solutions to requests crawler [Chinese garbled characters]

Requests is a relatively simple and easy-to-use HTTP request library. It is the most basic and commonly used library for writing crawler programs in python.
The [Chinese garbled code] problem is the most commonly encountered problem, and it is very troublesome for beginners.
This article will explain in detail the reasons why [Chinese garbled characters] appear when writing crawler programs using the requests library in Python, and three common solutions.

1. [Chinese Garbled Code] Situation and Reasons for Occurrence

(1) [Chinese Garbled Code] Example

First of all, the [Chinese garbled code] situation in this article refers to the situation where the Chinese content in the original web page is completely unrecognizable after using requests to obtain it, which is different from the encoding situations such as \x and \u. As shown in the example below:
Insert image description here
Note: The requests.get() method returns a response object, which stores the content of the server's response.

(2) Reasons for [Chinese Garbled Code] Appearance

The reason for the [Chinese garbled code] in the above picture:
When using the requests library, the text response method selected is inappropriate, and the appropriate encoding is not added to the code, so that the web page encoding automatically obtained by using [response.text] , which is inconsistent with the encoding of the actual web page, resulting in [Chinese garbled code].
When using the requests library, you may have formed a habit. [response.text] is often used for text responses, while [response.content] is often used for pictures, videos, etc.
The biggest difference between the two is:
1. [response.text] will automatically infer the encoding of the web page based on the HTTP header, decode it and return the decoded text.
2. [response.content] will not be decoded and returned directly in binary form.
Two text response methods, as shown in the following table:

method Definition
response.text The content of the server response will be automatically decoded according to the character encoding of the response header. Make an educated guess about the response's encoding based on HTTP headers, inferring the text encoding. Return type: str; commonly used: response text
response.content The response body is in bytes and no educated guesses about the response encoding can be made based on the HTTP headers. Return type: bytes (binary); commonly used in: pictures, videos

2. 3 ways to deal with [Chinese garbled characters]

(1) Modify the method of obtaining web page text

According to the above, the known reason is that the method of obtaining text is wrong. Obviously the simplest and most direct method is to
directly replace response.text with response.content
Insert image description here

(2) Manually specify the web page encoding and then extract the text

According to the above, it is known that when using [response.text], it will be decoded and returned, but the decoding is inconsistent with the original web page encoding, resulting in [Chinese garbled code].
Given that response also provides [response.encoding] to specify the encoding of the returned web page.
So the solution can be:
manually specify the web page encoding to get normal text.
This method is more troublesome than the first method:
first, you need to confirm the actual encoding of the original web page, and then make modifications based on the actual encoding of the web page.
The specific steps are as follows:
1. Check the web page encoding
. There are two ways to check the web page encoding:
(1) Directly open the web page source code (html) [Ctr+U] and check the value of encoding: [charset].
Insert image description here
(2) Use the encoding and apparent_encoding of response to get the web page encoding.
The biggest difference between encoding and apparent_encoding:
encoding is extracted from the header, while apparent_encoding is parsed from the web page source code. The results obtained by apparent_encoding are more accurate.
Details are as follows:

Attributes Definition
response.encoding Extract the encoding in the charset field from the header of the web page response. If there is no charset field in the header, the default encoding mode is ISO-8859-1. ISO-8859-1 encoding cannot parse Chinese, which is also the reason why Chinese characters are garbled.
response.apparent_encoding Analyze the way the web page is encoded from the content of the web page (html source code). Therefore, apparent_encoding is more accurate than encoding, and what is obtained is the actual encoding of the original web page.

Taking the URL in (1) as an example, the real encoding of the web page is [GB2312].
Using encoding and apparent_encoding methods, the results obtained are inconsistent. Apparent_encoding is the actual encoding of the original web page. As shown below:
Insert image description here
2. Manually specify text encoding
. According to the above method, after obtaining the actual encoding of the original web page, manually specify the text encoding format in the code to solve the [Chinese garbled code] problem.
There are two ways to write, you can choose one of them, as shown below:
Insert image description here

(3) Transcoding [Chinese garbled characters] after text acquisition

In addition to the above two solutions, you can also use the encoding method that comes with Python to transcode the [Chinese garbled] content again and convert it into the actual encoding format of the web page.
Transcoding method: encode('iso-8859-1').decode('encoding format')
As in the above example, the web page encoding is actually "gb2312", and the code can be modified as: The
Insert image description here
above is when writing a crawler in python using the requests library, The causes of [Chinese garbled characters] and three common processing methods are available for reference.

-end

Guess you like

Origin blog.csdn.net/LHJCSDNYL/article/details/131755340