The difference between Unicode and UTF-8

Reposted from https://www.zhihu.com/question/23374078

Personal learning collection, intrusion and deletion

------------------------------------

simply put:

  • Unicode is the "character set"
  • UTF-8 is the "encoding rule"

among them:

  • Character set: Assign a unique ID to each "character" (scientific name: code point / code point / Code Point)
  • Encoding rules: the rules for converting "code points" into byte sequences (encoding/decoding can be understood as the process of encryption/decryption)

 

Unicode in a broad sense is a standard that defines a character set and a series of encoding rules, that is, Unicode character set and UTF-8, UTF-16, UTF-32, etc. encoding...

The Unicode character set assigns a code point to each character. For example, the code point of "knowledge" is 30693, which is recorded as U+77E5 (30693 hexadecimal notation is 0x77E5).

UTF-8, as its name implies, is a set of variable-length codes with 8 bits as an encoding unit. Will encode a code point into 1 to 4 bytes:

U+ 0000 ~ U+  007F: 0XXXXXXX
U+ 0080 ~ U+  07FF: 110XXXXX 10XXXXXX
U+ 0800 ~ U+  FFFF: 1110XXXX 10XXXXXX 10XXXXXX
U+10000 ~ U+10FFFF: 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX

According to the coding rules in the above table, the code bit U+77E5 of the previous "knowledge" character belongs to the range of the third line:

       7    7    E    5    
    0111 0111 1110 0101    二进制的 77E5
--------------------------
    0111   011111   100101 二进制的 77E5
1110XXXX 10XXXXXX 10XXXXXX 模版(上表第三行)
11100111 10011111 10100101 代入模版
   E   7    9   F    A   5

This is the process of encoding U+77E5 into the byte sequence E79FA5 according to UTF-8. vice versa.

Guess you like

Origin blog.csdn.net/yocencyy/article/details/105935077