One article understands Python file reading error UnicodeDecodeError: 'gbk' codec can't decode byte

Problem Description:

Here is a very simple example of reading and printing the contents of a file:

with open('test.txt', 'r') as f:
    contents = f.read()

print(contents)

In the test.txt text file, there is only one word `you`:

test.txt

However, when we run this code, we get the following error:

Error:

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa0 in position 2: incomplete multibyte sequence

analyse problem:

  1. First of all we have to know what this error means.

The translation of the error report is:

Unicode decoding error: 'gbk' codec can't decode byte 0xa0 in position 2: incomplete multibyte sequence

  1. By understanding the error message, we know that this is a decoding error . To analyze this error, we must first have a basic understanding of Python's character encoding.

For the problem of Python character encoding, you can take a look at my blog post:

One article to understand Python character encoding (encoding method, garbled characters and error reasons) - Programmer Sought

In this blog post, I not only introduced the character encoding, but also analyzed the specific reasons for this error, but here we mainly discuss the solution, so I won’t go into details. For details, please refer to Sections 3 and 4 of this article.

  1. Now we know that this error is because when we use gbk (Windows platform, the default encoding is gbk) to decode the text, there is an extra byte that cannot be decoded. gbk encodes a Chinese character into 2 bytes , that is, every two bytes can be decoded into a Chinese character, and one byte cannot be decoded, so an error is reported, which also explains the following error message: incomplete multibyte sequence ( incomplete multibyte sequence ).

  1. This kind of problem occurs, generally because the text file is encoded with utf-8 ( utf-8 encodes a Chinese character into 3 bytes ), but we use gbk to decode it. Since the encoding methods of the two are different for Chinese characters, extra bytes that cannot be decoded happen to appear during decoding, so an error occurs.

Why does it say that there are exactly extra undecodable bytes?

Because there is a special situation that will not report an error! Two Chinese characters (6 bytes) are encoded with utf-8. At this time, gbk can be used to decode them into 3 Chinese characters (6 bytes are divided into 3 parts and 2 bytes, which can correspond to 3 Chinese characters). In this case, no error will be reported, but the displayed information is different, which is what we often call garbled characters . For details, see the blog post mentioned above.

You can change the content in the read test.txt file to two Chinese characters "Hi Hao". After running this code, you will find that there is no error, but the printed information is not "Hi Hao".

5. To solve this problem, we need to let Python use utf-8 to decode the file.

Solution:

When using open(), we add the parameter encoding='utf-8'. Using this parameter is equivalent to telling Python: Our file is encoded in utf-8. When you decode this file later, use utf-8 to decode it instead of gbk.

with open('test.txt', 'r', encoding='utf-8') as f:
    contents = f.read()

print(contents)

output:

success! problem solved.

Guess you like

Origin blog.csdn.net/lyb06/article/details/129675526