One article to understand Python character encoding (encoding method, garbled characters and error reasons)

1. The emergence of character encoding

  • Because computers can only process numbers, if you want to process text, you must first convert the text to numbers before you can process it.

  • Therefore, character encoding is to formulate a code table to correspond characters and codes (which can be simply understood as numbers). Character encoding uniquely corresponds each character to a code. In this way, we can save the code corresponding to the character on the computer, and when we view it, the corresponding character can be displayed through the code table.

2. Different encoding methods

Does the above approach look perfect? But unfortunately, there are thousands of languages ​​in the world, and it is very difficult to add characters from all languages. Therefore, many countries have their own codes. For example, gbk is the national standard code used in China. Japan codes Japanese into Shift_JIS , and South Korea codes Korean into Euc-kr . Each country may have its own encoding standards, and if different encoding methods are used, garbled characters may appear. The following are some of the most common encodings:

  1. ASCII encoding

At the earliest, only 127 characters were encoded into the computer, that is, uppercase and lowercase English letters, numbers and some symbols. This code table is called ASCII code .

  1. GB2312 and GBK encoding

China formulated the GB2312 code , which was used to encode Chinese, and then issued the GBK code . GB2312 is a simplified Chinese character encoding specification, but GBK is a large character set, which not only includes Simplified Chinese, but also Traditional Chinese includes double-byte characters of all Asian languages ​​such as Japanese and Korean.

  1. Unicode character set

In order to solve the problem of garbled characters caused by different encodings, the Unicode character set came into being. Unicode unifies all languages ​​into one set of codes, so that there will be no more garbled characters. Unicode is also known as Unicode and Unicode. It sets a unified and unique binary code for each character in each language to meet the requirements for text conversion and processing across languages ​​and platforms.

  1. UTF-8

If it is unified into Unicode encoding, the problem of garbled characters will disappear from then on. However, if the text you write is basically all in English, using Unicode encoding requires twice as much storage space as ASCII encoding, which is very uneconomical in terms of storage and transmission. In the spirit of saving, UTF-8 encoding that converts Unicode encoding into "variable length encoding" has appeared. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different number sizes. Commonly used English letters are encoded into 1 byte, and Chinese characters are usually 3 bytes. Only very rare characters will be encoded. Encoded into 4-6 bytes.

3. Interpretation of the reasons for garbled characters

For different encoding methods, one thing deserves our attention:

  • GB2312, GBK and UTF-8 are compatible with ASCII codes. That is to say, when we encode English characters and numbers, no matter which encoding method we use, they can be interpreted correctly.

This also shows that for a text file of pure English and numbers, no matter which encoding we use, there will be no garbled characters.

  • Therefore, most of the garbled characters we often encounter are due to the different encoding methods of GBK and UTF-8 for Chinese.

When we use GBK encoding to save a text file containing Chinese, and then decode the file through UTF-8, garbled characters will appear.

Similarly, when we use UTF-8 encoding to save a text file containing Chinese, and then decode the file through GBK, garbled characters will appear.

The reason can be clearly seen through simple Python code:

print('ABC'.encode('ascii'))    # 对'ABC'用ascii编码
print('ABC'.encode('gbk'))
print('ABC'.encode('utf-8'))

# print('你好'.encode('ascii')) 报错,ascii不能编码中文
print('你好'.encode('gbk'))
print('你好'.encode('utf-8'))

The output is as shown in the picture:

This simple code clearly shows that English characters use ASCII encoding, GBK encoding and UTF-8 encoding in the same way, there will be no interpretation errors, and there will be no garbled characters. For Chinese characters, a Chinese character in gbk is encoded as two bytes, while a Chinese character in utf-8 is encoded as three bytes, and the encoding methods of the two are different. Therefore, when interpreting UTF-8 encoded text with GBK or interpreting GBK encoded text with UTF-8, garbled characters or errors will appear.

4. Supplement (garbled code or error?)

You may find that when you use one encoding to interpret the text of another encoding in Python, sometimes there will be garbled characters, but sometimes an error will be reported directly. Why is this?

I don't know if you have thought of a problem when you look at the above code. A Chinese character in gbk is coded as two bytes, while a Chinese character in utf-8 is coded as three bytes, so if you use the wrong coding method for a Chinese character, you will definitely report an error.

print('你'.encode('gbk').decode('utf-8'))    # 对'你'用gbk编码,再用utf-8解码
print('你'.encode('utf-8').decode('gbk'))

These two statements will report an error, and the error message is the decoding error that everyone often encounters:

first sentence:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

The second sentence:

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa0 in position 2: incomplete multibyte sequence

One Chinese character will report an error, so what if I have two Chinese characters?

print('你好'.encode('utf-8').decode('gbk'))    # 对'你好'用gbk编码,再用utf-8解码
print('你好'.encode('gbk').decode('utf-8'))

The result is as follows:

The first sentence is decoded into three Chinese characters, which is different from the initial one, and the second sentence reports an error. I don’t know if you have found the reason here?

A Chinese character in utf-8 is encoded as three bytes, then two Chinese characters are encoded as 6 bytes, and a Chinese character in gbk is encoded as two bytes, so when using gbk to decode, these 6 bytes can be Divided into 3 parts, corresponding to 3 Chinese characters.

In the second sentence, gbk encodes two Chinese characters into 4 bytes, but utf-8 cannot decode 4 bytes. This also explains the following error message: incomplete multibyte sequence (incomplete multibyte sequence).

From this, it can be concluded that for a sentence of Chinese, when using gbk encoding, if the total number of encoded bytes is an integer multiple of 3, it can be decoded with utf-8, but the content is different. When using utf-8 encoding, if the total number of encoded bytes is an integer multiple of 2, it can be decoded with gbk, but the content is different.

Guess you like

Origin blog.csdn.net/lyb06/article/details/129676450