The relationship between Unicode and UTF-8

https://blog.csdn.net/xiaolei1021/article/details/52093706

 

Unicode, also known as Universal Code, specifies the correspondence between symbols and binary codes, but does not specify how binary codes are stored.

If the Unicode code of 'a' is 0x0d12, and the unicode code of 'b' is 0x23d4; if 0x od12 23d4 is given, it cannot be interpreted as

'ab' because 0x od12 23d4 is represented as another symbol in the unicode code table.

Then, we have to specify the storage method of unicode binary. UTF-8 is the most widely used Unicode implementation on the Internet.

utf-8 is a variable-length encoding method. In order to save memory resources, English numbers are stored in 1 byte (compatible with ascii encoding), and Chinese are generally stored in 3 bytes.

The encoding rules of UTF-8 are very simple, there are only two:

1) For a single-byte symbol, the first bit of the byte is set to 0, and the last 7 bits are the unicode code of the symbol. So for English letters, UTF-8 encoding and ASCII code are the same.

2) For the symbol of n bytes (n>1), the first n bits of the first byte are set to 1, the n+1th bit is set to 0, and the first two bits of the following bytes are set to 10. The remaining unmentioned binary bits are all the unicode codes of this symbol.

The following table summarizes the encoding rules, the letter x indicates the bits of the available encoding.

Unicode Symbol Range | UTF-8 Encoding
(Hexadecimal) | (Binary)
--------------------+---------- -----------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

According to the above table, interpreting UTF-8 encoding is very simple. If the first bit of a byte is 0, the byte is a character by itself; if the first bit is 1, how many consecutive 1s there are means how many bytes the current character occupies.

Next, take the Chinese character "Yan" as an example to demonstrate how to implement UTF-8 encoding.

It is known that "strict" unicode is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes , that is, the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, starting from the last binary digit of "strict", fill in the x in the format from back to front, and add 0 to the extra bits. In this way, the UTF-8 encoding of "strict" is "11100100 10111000 10100101", and the conversion to hexadecimal is E4B8A5.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325148046&siteId=291194637