# -*- coding in Python code: gbk -*-

# -*- coding: gbk -*-Such a statement is used in Python to specify the character encoding format of the source file, and this statement usually appears on the first or second line of the Python source file (after the comment).

effect


Since Python uses ASCII encoding to parse the source code by default, if the source file contains non-ASCII encoding characters (such as Chinese characters), the interpreter may throw a SyntaxError exception. Adding # -*- coding: gbk -*-such a comment statement can tell the interpreter that the character encoding format of the current source file is GBK, so as to prevent the Chinese characters of the source file from being parsed incorrectly.

principle


When Python parses the source code, it will first read the first few bytes of the source file to determine whether it contains a BOM (Byte Order Mark). If there is a BOM, the file will be parsed according to the encoding format of the BOM; The specified encoding format parses the file, if no encoding format is specified, the default ASCII encoding format is used to parse the file.

# -*- coding: XXX -*-When such a comment statement is included in the source file , the interpreter will parse the source file according to the encoding format provided by this statement.

- - coding: utf8 - - what is it?


# -*- coding: utf8 -*-The functions of and # -*- coding: gbk -*-are the same, both are used to specify the character encoding format of the source file, except that one is UTF-8 encoding format and the other is GBK encoding format. In Python 3, it is recommended to specify # -*- coding: utf-8 -*-the character encoding format of the source file, because Python 3 uses UTF-8 encoding format by default.

What is the difference between the two?


Both GBK and UTF-8 are multi-byte encoding formats, but they are encoded differently.

The GBK encoding method adopts double-byte encoding, and each Chinese character occupies two bytes; while the UTF-8 encoding method adopts variable-length encoding, and a Chinese character usually occupies three to four bytes, depending on its location.

Therefore, if the code contains a large number of Chinese characters, using the GBK encoding format can make the file more compact and the number of lines of code is less; while using UTF-8 can avoid encoding problems that cause the code to fail to run, because UTF- 8 is a universal encoding.

alternative method


# -*- coding: XXX -*-In addition to adding such a comment statement at the beginning of the source file , there are the following alternatives:

    1. Specify the encoding format when using the open() function to open the file:

with open("filename", "r", encoding="gbk") as f:
    # 读取文件内容

     2. Convert the character encoding format of the source file to Python's default UTF-8 encoding:

source = open("filename", encoding="gbk").read().encode("utf-8")
exec(source.decode("utf-8"))

 

Other knowledge points


    1.BOM

BOM (Byte Order Mark) is a special character used to identify the sequence of character streams in the Unicode character encoding standard. It usually appears at the beginning of Unicode text files in the form of 0xFEFF. In Python, if the source file contains a BOM, the interpreter will parse the source file according to the encoding format specified by the BOM.

    2. Code conversion

In Python, encoding conversion can be done using str.encode()the and methods. bytes.decode()For example, convert a string to a GBK-encoded byte string:

s = "中文"
b = s.encode("gbk")

Convert a GBK-encoded byte string to a Unicode string:

b = b"\xd6\xd0\xce\xc4"
s = b.decode("gbk")

    3.Unicode

Unicode is a character set, which includes all known characters, symbols and emoticons, and each character has a unique Unicode code. In Python, strings are encoded in Unicode by default.

Summarize


Statements in Python # -*- coding: XXX -*-can be used to specify the character encoding format of the source file to avoid incorrect parsing of Chinese characters in the source file. In addition to adding such a comment statement at the beginning of the source file, you can also specify the encoding format when using the open() function to open the file, or convert the character encoding format of the source file to Python's default UTF-8 encoding. When dealing with encoding issues, we also need to understand related knowledge points such as BOM, encoding conversion, and Unicode.

It should be noted that when using a statement to specify the character encoding format, it should be guaranteed to appear on the first or second line of the source file, and there must be no other characters or spaces after # -*- coding: XXX -*-the comment symbol . #At the same time, when selecting the encoding format, it is also necessary to choose according to the actual situation to avoid problems such as encoding conversion.

In short, in Python development, the coding problem is an inevitable problem. Only by deeply understanding the relevant knowledge points and adopting appropriate solutions can the project development and deployment be successfully completed.

Guess you like

Origin blog.csdn.net/weixin_40025666/article/details/131305129