Regarding the first two lines of the Python script: #!/usr/bin/python and # -*- coding: utf-8 -*- The role of – turn

#!/usr/bin/python

is used to indicate that the scripting language is python

It is to use the program (tool) python under /usr/bin, this interpreter, to interpret the python script and run the python script.

 

# -*- coding: utf-8 -*-

is used to specify the file encoding as utf-8

For details, please refer to:

PEP 0263 — Defining Python Source Code Encodings

 

Here, explain in detail (mainly translation) why this encoding declaration is added and how to add the encoding declaration:

 

Declare previous problems with file encoding

In Python 2.1, if you want to input Unicode characters, you can only use "unicode-escape" based on Latin-1 to input -> For other non-Latin-1 countries and users, it is very cumbersome to input Unicode characters, inconvenient.

Hope is:

Programmers can input strings in any encoding method according to their own preferences and needs, and this is normal.

 

Recommended options

Therefore, someone gave official advice to Python, so there is this PEP 0263.

This suggestion is:

It is allowed to declare in the Python file, through the beginning of the file, in the comment, in the form of a string, declare, declare your own python file, what encoding is used.

As a result, many changes are required, especially the parser for Python files that can recognize such file encoding declarations.

 

How specifically to declare the python file encoding?

As mentioned above, yes, at the beginning of the file, in the comments, in the form of strings, declarations.

How to declare it, and in what format?

In fact, it's what you've seen before, this:

?
1
# -*- coding: utf-8 -*-

A detailed explanation of this format is:

  1. If there is no declaration of this file encoding type, python defaults to ASCII encoding to process
    • If you do not declare the encoding, but the file contains non-ASCII encoded characters, the python file that the python parser parses will naturally report an error.
  2. Must be placed on the first or second line of the python file
  3. There are three supported formats:
    1. With an equals sign:
      ?
      1
      # coding=<encoding name>
    2. The most common, with a colon (which most editors recognize correctly):
      ?
      1
      2
      #!/usr/bin/python
      # -*- coding: <encoding name> -*-
    3. vim:
      ?
      1
      2
      #!/usr/bin/python
      # vim: set fileencoding=<encoding name> :
  4. A more precise explanation is:
    • Matching regular expressions:
      ?
      1
      "coding[:=]\s*([-\w.]+)"
    • You can, obviously, if you are familiar with regular expressions, you can write them out, and some other legal encoding declarations, take utf-8 as an example, such as:
      ?
      1
      2
      3
      4
      5
      coding:         utf - 8
      coding = utf - 8
      coding =                  utf - 8
      encoding:utf - 8
      crifanEncoding = utf - 8
  5. To take care of UTF-8 with BOM ('\xef\xbb\xbf') in special Windows :
    1. If your python file encoding is UTF-8 with BOM, that is, the first three bytes of the file are: '\xef\xbb\xbf', then:
      1. Even if you do not specify the file encoding, it is automatically treated as UTF-8 encoding
      2. If you declare the file encoding, it must be declared (consistent with your file encoding itself) UTF-8
        1. Otherwise (due to the inconsistency between the declared encoding and the actual encoding, naturally) an error will be reported

 

Various examples of file encoding declarations

For the above rules, various, legal, illegal, examples are given below for reference:

Legal python file encoding declaration

  1. with interpreter-declared, Emacs-style, (commented) file-encoding declaration
    1. Example 1:
      ?
      1
      2
      3
      4
      #!/usr/bin/python
      # -*- coding: latin-1 -*-
      import os, sys
      ...
    2. Example 2:
      ?
      1
      2
      3
      4
      #!/usr/bin/python
      # -*- coding: iso-8859-15 -*-
      import os, sys
      ...
    3. Example 3:
      ?
      1
      2
      3
      4
      #!/usr/bin/python
      # -*- coding: ascii -*-
      import os, sys
      ...
  2. Without declaring the interpreter, use the plain text directly:
    ?
    1
    2
    3
    # This Python file uses the following encoding: utf-8
    import os, sys
    ...
  3. Text editors can also have multiple (other) ways of defining encodings:
    ?
    1
    2
    3
    4
    #!/usr/local/bin/python
    # coding: latin-1
    import os, sys
    ...
    • Obviously, the useless -*-, directly use the coding plus the coding value
  4. Without encoding declaration, it is treated as ASCII by default:
    ?
    1
    2
    3
    #!/usr/local/bin/python
    import os, sys
    ...

Example of illegal python file encoding declaration

  1. missing coding: prefix
    ?
    1
    2
    3
    4
    #!/usr/local/bin/python
    # latin-1
    import os, sys
    ...
  2. The encoding declaration is not on the first or second line:
    ?
    1
    2
    3
    4
    5
    #!/usr/local/bin/python
    #
    # -*- coding: latin-1 -*-
    import os, sys
    ...
  3. Unsupported, illegal character encoding (string) declaration:
    ?
    1
    2
    3
    4
    #!/usr/local/bin/python
    # -*- coding: utf-42 -*-
    import os, sys
    ...

 

The philosophy followed by the python file encoding declaration

1. In a single complete python source file, only a single encoding is used.

->

Data embedded with multiple encodings is not allowed

Otherwise, an encoding error will be reported (when the python interpreter parses your python file).

 

I don't understand this paragraph:

Any encoding which allows processing the first two lines in the way indicated above is allowed as source code encoding, this includes ASCII compatible encodings as well as certain multi-byte encodings such as Shift_JIS. It does not include encodings which use two or more bytes for all characters like e.g. UTF-16. The reason for this is to keep the encoding detection algorithm in the tokenizer simple.

 

2.这段也不太懂:

Handling of escape sequences should continue to work as it does now, but with all possible source code encodings, that is standard string literals (both 8-bit and Unicode) are subject to escape sequence expansion while raw string literals only expand a very small subset of escape sequences.

 

3.Python的分词器+编译器,会按照如下的逻辑去工作:

  1. 读取文件
  2. 不同的文件,根据其声明的编码去解析为Unicode
  3. 转换为UTF-8字符串
  4. 针对UTF-8字符串,去分词
  5. 编译之,创建Unicode对象

要注意的是:

Python中的标识符,都是ASCII的。

 

其余的内容,不翻译了。

至此,已经解释的够清楚了。

 
原文链接:http://www.crifan.com/python_head_meaning_for_usr_bin_python_coding_utf-8/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325309479&siteId=291194637