[Python] In-depth analysis of Chinese garbled problems and solutions

For a long time, the Chinese encoding in python has been an extremely big problem, and the exception of encoding conversion is often thrown. What are str and unicode in python?

In this article, the explanation of "ha" is used as an example to explain all the problems . The various encodings of "ha" are as follows:
1. UNICODE (UTF8-16), C854;
2. UTF-8, E59388;
3. GBK, B9FE.


1. str and unicode in python

For a long time, the Chinese encoding in python has been an extremely big problem, and the exception of encoding conversion is often thrown. What are str and unicode in python?
When unicode is mentioned in python, it generally refers to unicode objects . For example, the unicode object of 'haha' is

and'\u54c8\u54c8'

And str is a byte array, which represents the storage format after encoding the unicode object (it can be utf-8, gbk, cp936, GB2312). Here it is just a byte stream and has no other meaning. If you want to make the content displayed by this byte stream meaningful, you must use the correct encoding format and decode it for display.
For example:
 

Encode the unicode object haha, and encode it into a utf-8 encoded str-s_utf8, s_utf8 is a byte array, which stores '\xe5\x93\x88\xe5\x93\x88', but this is just a Byte array, if you want to output it as haha ​​via print statement, then you are disappointed, why?

Because the realization of the print statement is to transmit the content to be output to the operating system, the operating system will encode the input byte stream according to the encoding of the system, which explains why the string "haha" in utf-8 format is output It is "鍝红搱", because '\xe5\x93\x88\xe5\x93\x88' is interpreted in GB2312, and it is displayed as "鍝瓜搱". Here again, what str records is a byte array, which is just a certain encoded storage format. As for the format of the output to the file or the printout, it depends entirely on what the decoding encoding decodes it into.

Here is a little supplementary explanation for print: when a unicode object is passed to print, the unicode object will be converted internally to the local default encoding (this is just a personal guess)

Second, the conversion of str and unicode objects

The conversion of str and unicode objects is realized by encode and decode, and the specific usage is as follows:

Convert GBK 'haha' to unicode and then to UTF8

3. Setdefaultencoding

As shown in the demo code above:

When encoding s (gbk string) directly into utf-8, an exception will be thrown, but by calling the following code:

import sys

reload(sys)
sys.setdefaultencoding('gbk')

After that, the conversion can be successful, why? In the encoding and decoding process of str and unicode in python, if one str is directly encoded into another encoding, str will be decoded into unicode first, and the encoding used is the default encoding. Generally, the default encoding is anscii, so in the above example There will be an error when the code is converted for the first time. When the current default encoding is set to 'gbk', there will be no error.

As for reload(sys), it is because the sys.setdefaultencoding method will be deleted after Python2.5 is initialized, and we need to reload.

4. Manipulate files in encoding formats of different files

Create a file test.txt, the file format is ANSI , the content is:

abc Chinese

Use python to read:

# coding=gbk

print(open("Test.txt").read())

result:

abc Chinese

Change the file format to UTF-8:

result:

abc juan  

Obviously, decoding is required here:

# coding=gbk

import codecs

print(open("Test.txt").read().decode("utf-8"))

result:

abc Chinese

I edited the test.txt above with Editplus, but when I edited it with Windows’ built-in Notepad and saved it in UTF-8 format,

Error when running:

Traceback (most recent call last):

File "ChineseTest.py", line 3, in 

print open("Test.txt").read().decode("utf-8")

UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in position 0: illegal multibyte sequence

It turns out that some software, such as notepad, will insert three invisible characters (0xEF 0xBB 0xBF, or BOM) at the beginning of the file when saving a file encoded in UTF-8.

Therefore, we need to remove these characters when reading. The codecs module in python defines this constant:

# coding=gbk

import codecs
data = open("Test.txt").read()

if data[:3] == codecs.BOM_UTF8:
    data = data[3:]
    print(data.decode("utf-8"))

result:

abc Chinese

5. The encoding format of the file and the role of the encoding statement

What effect does the encoding format of the source file have on the declaration of the string? This problem has been bothering me for a long time, and now it is finally a little bit of a clue. The encoding format of the file determines the encoding format of the string declared in the source file, for example:

str = '哈哈'
print(repr(str))

a. If the file format is utf-8, the value of str is: '\xe5\x93\x88\xe5\x93\x88' (haha's utf-8 encoding)

b. If the file format is gbk, the value of str is: '\xb9\xfe\xb9\xfe' (haha's gbk encoding)

As mentioned in the first section, the string in python is just an array of bytes, so when the str of a case is output to the gbk-encoded console, it will be displayed as garbled characters: 鍝炼搱; and when b When str outputs the utf-8 encoded console in the case, it will also display garbled characters, and there is nothing. Maybe '\xb9\xfe\xb9\xfe' is displayed with utf-8 decoding, it is blank. >_<

After talking about the file format, let’s talk about the function of the encoding declaration. At the top of each file, a statement similar to # coding=gbk will be used to declare the encoding, but what is the use of this declaration? So far, I think it has three functions:

  1. Declare that non-ascii encoding will appear in the source file, which is usually Chinese;
  2. In an advanced IDE, the IDE will save your file format in the encoding format you specify.
  3. It is also a relatively confusing place to determine the encoding format used to decode "ha" into unicode, which is similar to u'ha' in the source code. See the example:
#coding:gbk

ss = u'哈哈'
print(repr(ss))
print('ss:%s' % ss)

Save these codes as a utf-8 text, run it, what do you think will be output? Everyone's first feeling is that the output must be:

and'\u54c8\u54c8'

ss: ha ha

But actually the output is:

and'\u935d\u581d\u6431'

ss: crucible

Why is this happening? At this time, the encoding statement is at fault. When running ss = u'haha', the whole process can be divided into the following steps:

1) Obtain the encoding of 'haha': determined by the file encoding format, it is '\xe5\x93\x88\xe5\x93\x88' (utf-8 encoding form of haha)

2) When converting to unicode encoding, during the conversion process, the decoding of '\xe5\x93\x88\xe5\x93\x88' is not decoded with utf-8, but with the code specified in the declared code GBK, decode '\xe5\x93\x88\xe5\x93\x88' according to GBK, and the result is ''鍝炒搱'', the unicode encoding of these three characters is u'\u935d\u581d\u6431', to It can only explain why print repr(ss) outputs u'\u935d\u581d\u6431'.

Ok, here's a bit of a twist, let's analyze the next example:

#-*- coding:utf-8 -*-

ss = u'哈哈'
print(repr(ss))
print('ss:%s' % ss)

Save this example in GBK encoding form this time, and the running result turns out to be:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xb9 in position 0: unexpected code byte

Why is there a utf8 decoding error here? Think about the previous example and understand, the first step of conversion, because the file encoding is GBK, the obtained "haha" encoding is the encoding of GBK'\xb9\xfe\xb9\xfe', when the second step is performed, it is converted into When using unicode, UTF8 will be used to decode '\xb9\xfe\xb9\xfe', and if you check the utf-8 encoding table, you will find that the utf8 encoding table (for the explanation of UTF-8, please refer to the character encoding notes: ASCII, UTF-8, UNICODE) does not exist at all, so the above error will be reported.

>>> I hope it is helpful to you, if you have other questions, you can discuss them in the comment area~

Guess you like

Origin blog.csdn.net/Xuange_Aha/article/details/130441906