ASCII

The Python interpreter encodes the content in the middle of loading the .py file (ascII by default).

ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange) is a computer encoding system based on the Latin alphabet, mainly used to display modern English and Western European languages, can only be represented by a maximum of 8 bits (one byte), That is, 2^8=256-1, so ASCII can only represent up to 255 characters.

About Chinese

To handle Chinese characters, the programmers designed GB2312 for Simplified Chinese and big5 for Traditional Chinese.

GB2312 (1980) contains a total of 7445 characters, including 6763 Chinese characters and 682 other symbols. The inner code range of the Hanzi area is from B0-F7 in the high byte and A1-FE in the low byte. The occupied code bits are 72*94=6769. There are 5 vacancies in it are D7FA-D7FE.

GB2312 supports too few Chinese characters. The Chinese character extension specification GBK1.0 in 1995 included 21,886 symbols, which were divided into Chinese characters area and graphic symbol area. The Chinese character area included 21,003 characters. GB18030 in 2000 is the national standard that replaced GBK1.0. The standard includes 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major minority languages. The current PC platform must support GB18030, and there is no requirement for embedded products. Therefore, GB2312 is generally supported for mobile phones and mp3.

From ASCII, GB2312, GBK to GB18030, these encoding methods are backward compatible, that is, the same character has the same encoding in these schemes, and the latter supports more characters. In these codes, English and Chinese can be handled uniformly. The method to distinguish Chinese encoding is that the highest bit of the high byte is not 0. According to the programmer's name, GB2312, GBK to GB18030 belong to the double-byte character set (DBCS).

The default internal code of some Chinese Windows is still GBK, which can be upgraded to GB18030 through the GB18030 upgrade package. However, the characters that GB18030 increases relative to GBK are usually difficult for people to use. Usually, we still use GBK to refer to the Chinese Windows internal code.

Obviously , ASCII code cannot represent all the various characters and symbols in the world, so it is necessary to create a new encoding that can represent all characters and symbols, namely Unicode

Unicode (Unicode, Universal Code, Single Code) is a character encoding used on computers. Unicode was created to solve the limitations of traditional character encoding schemes. It sets a consensus for each character of each language. And the only binary code, it is stipulated that although some characters and symbols are represented by at least 16 bits (2 bytes), that is: 2**16=65536,

Note: The minimum 2 bytes mentioned here, possibly more UTF-8, is the compression and optimization of Unicode encoding. He no longer uses at least 2 bytes, but classifies all characters and symbols as ASCII The content of the code is stored in 1 byte, European characters are stored in 2 bytes, and East Asian characters are stored in 3 bytes.

so. When the Python interpreter loads code in a .py file, it will encode the content by default.

 
   

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325215411&siteId=291194637