Reposted from https://www.zhihu.com/question/23374078
Personal learning collection, intrusion and deletion
------------------------------------
simply put:
- Unicode is the "character set"
- UTF-8 is the "encoding rule"
among them:
- Character set: Assign a unique ID to each "character" (scientific name: code point / code point / Code Point)
- Encoding rules: the rules for converting "code points" into byte sequences (encoding/decoding can be understood as the process of encryption/decryption)
Unicode in a broad sense is a standard that defines a character set and a series of encoding rules, that is, Unicode character set and UTF-8, UTF-16, UTF-32, etc. encoding...
The Unicode character set assigns a code point to each character. For example, the code point of "knowledge" is 30693, which is recorded as U+77E5 (30693 hexadecimal notation is 0x77E5).
UTF-8, as its name implies, is a set of variable-length codes with 8 bits as an encoding unit. Will encode a code point into 1 to 4 bytes:
U+ 0000 ~ U+ 007F: 0XXXXXXX
U+ 0080 ~ U+ 07FF: 110XXXXX 10XXXXXX
U+ 0800 ~ U+ FFFF: 1110XXXX 10XXXXXX 10XXXXXX
U+10000 ~ U+10FFFF: 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX
According to the coding rules in the above table, the code bit U+77E5 of the previous "knowledge" character belongs to the range of the third line:
7 7 E 5
0111 0111 1110 0101 二进制的 77E5
--------------------------
0111 011111 100101 二进制的 77E5
1110XXXX 10XXXXXX 10XXXXXX 模版(上表第三行)
11100111 10011111 10100101 代入模版
E 7 9 F A 5
This is the process of encoding U+77E5 into the byte sequence E79FA5 according to UTF-8. vice versa.