Summarize the concepts related to coding problems.
Difference between character set and encoding
A character set is a collection of characters, that is, a unique ID (Code Point) is assigned to each character.
Encoding is the rule for converting code points into byte sequences. Because the code point of the character is not necessarily directly stored in the actual storage (for example, to save space), a conversion rule is needed, and this rule is the encoding .
ASCII , GB2312 , GBK are all character sets, and they also represent the corresponding encoding methods (not strictly differentiated).
Unicode is a character set, but it is encoded in various ways, such as utf-8 and utf-16 .
Various character sets and encodings
ASCII: Stores all English characters and some symbols, represented by 0~127, each character is 1 byte.
GB2312: It is
ASCII
the Chinese extension of the code, the bytes less than 127 are the same as the original, and two bytes greater than 127 are used to represent Chinese, adding about 7000 Chinese characters. At the same timeASCII
, the original characters in the code are re-codedGBK: It is
gb2312
a correct extension. It stipulates that as long as the high byte is greater than 127, it means that this is the beginning of a Chinese character, thus adding more than 20,000 Chinese characters. Later, thousands of ethnic minority characters were added and expanded toGB18030
.ANSI: Different countries have developed different standards, including American
ASCII
, ChineseGBK
, JapaneseShift_JIS
, Korean,Euc-kr
etc. The coding standards of these different countries are collectively calledANSI
coding. DifferentANSI
encodings are incompatible with each other. For the Simplified Chinese system,ANSI
encoding is equivalent toGBK
encoding.Unicode: is a set of characters that includes all the words and symbols in the world. The total space is 17 planes (0x0000~0x10ffff), and the most commonly used plane 0 (MBP) contains 65535 code points, expressed in 2 bytes.
UTF-8: It is
unicode
an encoding method that encodes a code point into 1~4 bytes.
For single-byte characters, the first bit of the byte is set to 0, which is theASCII
same as the code; and for n-byte characters (n>1), the first n bits of the first byte are set to 1, and the nth The +1 bit is set to 0, the first two bits of the following bytes are set to 10, and the remaining vacancies are filled with the unicode code of the character from low to high , and the high bits are filled with 0.
UTF-8
The encoding encodes English characters as 1 byte, and Chinese characters are generally encoded as 3 bytes, whichunicode
saves space compared to directly storing code points.U+ 0000 ~ U+ 007F: 0XXXXXXX U+ 0080 ~ U+ 07FF: 110XXXXX 10XXXXXX U+ 0800 ~ U+ FFFF: 1110XXXX 10XXXXXX 10XXXXXX U+10000 ~ U+1FFFF: 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX
- Windows uses GBK encoding by default, while Linux uses UTF-8 encoding by default, and transcoding is required when transferring files between different systems.
Python encoding problem
In Python, strings (str)
unicode
exist in the form, that is, each character has a corresponding unique number, whichord(str)
can be obtained (that is, the code point), andchr(codepoint)
the character corresponding to the number can be obtained.>>> ord('a') 97 >>> ord('中') 20013 >>> chr(20013) '中' >>> '\u4e2d' '中'
Encoding and decoding
Python strings (str) exist in memory in theUnicode
form, and if they need to be transmitted over the network or stored on the hard disk , they must beencode()
encoded into a byte stream (bytes). Likewise, byte streams read from the network or disk can bedecode()
decoded into strings.>>> 'ABC'.encode('ascii') b'ABC' >>> '中文'.encode('utf-8') b'\xe4\xb8\xad\xe6\x96\x87' >>> '中文'.encode('gb2312') b'\xd6\xd0\xce\xc4' >>> b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8') '中文'
- The encoding problem of file IO
Python can useopen()
to open the file, and the receiving parameters are the file path , the opening method (r/w), and the encoding method (encoding).
If reading a file , theencoding
parameter specifies the encoding method used to decode (decode) the file content into a string str (unicode), whichencoding
should be consistent with the file encoding.
If writing to a file ,encoding
specify the encoding used to encode the string (unicode) and store it in the file.
Open a file in Python. If noencoding
parameters are specified, it will be used by default , which is the encodingcp936
we are familiar with .GBK