Encoding when reading and writing files in Python

In "Reading Files in Python" and "Writing Files in Python", it is mentioned that the file object can be read and written by calling the read() function and write() function. When the above methods can correctly read or write English, when the content to be read and written is Chinese, you need to consider the encoding method.

1 Read existing data

1.1 Create a file

Create a txt file, enter the Chinese content "Nihao world" in the file, you can see that the encoding method of the file is "UTF-8", as shown in Figure 1.

Figure 1 Create a new txt file

Related links 1 UTF-8 encoding is the abbreviation of Unicode Transformation Format, that is, variable-length character encoding. Using UTF-8 encoding, English letters are represented by one byte, and Chinese characters are represented by three bytes. The UTF-8 encoding of the four Chinese characters "Hello World" is shown in Figure 2.

Figure 2 UTF-8 encoding of "Hello World"

1.2 Reading files

1.2.1 Unspecified encoding method

Use the code shown in Figure 3 to read the file.

 

Figure 3 The code to read the file without specifying the encoding method

At this time, the open() function does not specify which encoding method to use to read the file, so the printed content is garbled.

1.2.2 Specify the encoding method

The encoding parameter of the open() function specifies the encoding method to read the file. The code is shown in Figure 4.

Figure 4 The code to read the file by specifying the encoding method

In the above code, the "UTF-8" encoding method is used to read the file content, and the output at this time is "Hello world".

2 Read data written by Python code

When reading data written by Python code, you should use the same encoding that was used to write it. For example, if you use the "gbk" method when writing, you must also use the "gbk" method when reading, instead of using the "UTF-8" method.

Related Links 2 GBK encoding, Chinese Internal Code Specification is the abbreviation of the Chinese Internal Code Extension Specification, where K is the abbreviation of the pinyin "KuoZhan " for "extended" in Chinese . Both English letters and Chinese characters are represented by two bytes, as shown in Figure 5.

Figure 5 GBK encoding of "Hello World"

2.1 Write data using a specified method

Use the "GBK" encoding method to write "Hello world" to the data.txt file, the code is shown in Figure 6.

Figure 6 Specify the encoding method to write data to the file code

Open data.txt, and you can see that the encoding method of the file is "ANSI", as shown in Figure 7.

Figure 7 txt file written using Python code

Related links 3 ANSI encoding is the abbreviation of American National Standards Institute, that is, the standard of the American National Institute. In the simplified Chinese operating system, ANSI encoding represents GBK encoding.

2.2 Read files using the same encoding

The code to read the file is shown in Figure 8.

Figure 8 Read the file using the same encoding

If the encoding method used for reading is different from that for writing, the data cannot be read correctly, as shown in Figure 9.

Figure 9 Reading files using different encoding methods

Guess you like

Origin blog.csdn.net/hou09tian/article/details/131585420