Coding Problem Summary

Summarize the concepts related to coding problems.

Difference between character set and encoding

  • A character set is a collection of characters, that is, a unique ID (Code Point) is assigned to each character.

  • Encoding is the rule for converting code points into byte sequences. Because the code point of the character is not necessarily directly stored in the actual storage (for example, to save space), a conversion rule is needed, and this rule is the encoding .

  • ASCII , GB2312 , GBK are all character sets, and they also represent the corresponding encoding methods (not strictly differentiated).

  • Unicode is a character set, but it is encoded in various ways, such as utf-8 and utf-16 .

Various character sets and encodings

  • ASCII: Stores all English characters and some symbols, represented by 0~127, each character is 1 byte.

  • GB2312: It is ASCIIthe Chinese extension of the code, the bytes less than 127 are the same as the original, and two bytes greater than 127 are used to represent Chinese, adding about 7000 Chinese characters. At the same time ASCII, the original characters in the code are re-coded

  • GBK: It is gb2312a correct extension. It stipulates that as long as the high byte is greater than 127, it means that this is the beginning of a Chinese character, thus adding more than 20,000 Chinese characters. Later, thousands of ethnic minority characters were added and expanded to GB18030.

  • ANSI: Different countries have developed different standards, including American ASCII, Chinese GBK, Japanese Shift_JIS, Korean, Euc-kretc. The coding standards of these different countries are collectively called ANSIcoding. Different ANSIencodings are incompatible with each other. For the Simplified Chinese system, ANSIencoding is equivalent to GBKencoding.

  • Unicode: is a set of characters that includes all the words and symbols in the world. The total space is 17 planes (0x0000~0x10ffff), and the most commonly used plane 0 (MBP) contains 65535 code points, expressed in 2 bytes.

  • UTF-8: It is unicodean encoding method that encodes a code point into 1~4 bytes.
    For single-byte characters, the first bit of the byte is set to 0, which is the ASCIIsame as the code; and for n-byte characters (n>1), the first n bits of the first byte are set to 1, and the nth The +1 bit is set to 0, the first two bits of the following bytes are set to 10, and the remaining vacancies are filled with the unicode code of the character from low to high , and the high bits are filled with 0.
    UTF-8The encoding encodes English characters as 1 byte, and Chinese characters are generally encoded as 3 bytes, which unicodesaves space compared to directly storing code points.

    U+ 0000 ~ U+ 007F: 0XXXXXXX
    U+ 0080 ~ U+ 07FF: 110XXXXX 10XXXXXX
    U+ 0800 ~ U+ FFFF: 1110XXXX 10XXXXXX 10XXXXXX
    U+10000 ~ U+1FFFF: 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX
  • Windows uses GBK encoding by default, while Linux uses UTF-8 encoding by default, and transcoding is required when transferring files between different systems.

Python encoding problem

  • In Python, strings (str) unicodeexist in the form, that is, each character has a corresponding unique number, which ord(str)can be obtained (that is, the code point), and chr(codepoint)the character corresponding to the number can be obtained.

    >>> ord('a')
    97
    >>> ord('中')
    20013
    >>> chr(20013)
    '中'
    >>> '\u4e2d'
    '中'
  • Encoding and decoding
    Python strings (str) exist in memory in the Unicodeform, and if they need to be transmitted over the network or stored on the hard disk , they must be encode()encoded into a byte stream (bytes). Likewise, byte streams read from the network or disk can be decode()decoded into strings.

    >>> 'ABC'.encode('ascii')
    b'ABC'
    >>> '中文'.encode('utf-8')
    b'\xe4\xb8\xad\xe6\x96\x87'
    >>> '中文'.encode('gb2312')
    b'\xd6\xd0\xce\xc4'
    >>> b'\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8')
    '中文'
  • The encoding problem of file IO
    Python can use open()to open the file, and the receiving parameters are the file path , the opening method (r/w), and the encoding method (encoding).
    If reading a file , the encodingparameter specifies the encoding method used to decode (decode) the file content into a string str (unicode), which encodingshould be consistent with the file encoding.
    If writing to a file , encodingspecify the encoding used to encode the string (unicode) and store it in the file.
    Open a file in Python. If no encodingparameters are specified, it will be used by default , which is the encoding cp936we are familiar with .GBK

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325935800&siteId=291194637