UTF is variable length encoding

When reading the documentation about UTF8 variable length encoding, I saw the following content,

In the document, 1110 means that it involves 3 bytes, 10 means that it involves one byte, and there are three 1s in the high bits of the following 1110, which means that there are 3 bytes from the current byte to participate in representing UNICODE, and there is one 1 in the high bits of the following, which means There is 1 byte starting from the current byte to represent UNICODE. What does this mean? In fact, it should be understood as follows, especially the red bold part below:

UTF-8 is a variable-length byte encoding. For the UTF-8 encoding of a certain character, if there is only one byte, the highest binary bit is 0; if it is multiple bytes, the first byte starts from the highest bit, and the number of consecutive binary bits is 1. Determines the number of digits to encode, and the remaining bytes start with 10 . UTF-8 can be used up to 6 bytes. 
As shown in the table: 
1 byte 0xxxxxxx 
2 bytes 110xxxxx 10xxxxxx 
3 bytes 1110xxxx 10xxxxxx 10xxxxxx 
4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 
5 bytes 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 6 
bytes 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 
so UTF-8 can be used The actual number of bits representing the character encoding is up to 31 bits, which is the bit represented by x in the above table. Except for those control bits (10 at the beginning of each byte, etc.), the bits represented by x correspond to the UNICODE encoding one-to-one, and the bit order is the same. 
When actually converting UNICODE to UTF-8 encoding, you should first remove the high-order 0s, and then determine the minimum number of UTF-8 encoding digits required based on the remaining encoding digits. 
Therefore, those characters in the basic ASCII character set (UNICODE compatible with ASCII) can be represented by only one byte of UTF-8 encoding (7 binary bits). 

Guess you like

Origin blog.csdn.net/u013171226/article/details/131195669