Character encoding - utf-8 Chinese is a Chinese character that occupies three bytes in length?

This is a good question and can be used as a written test. Let's start with character encoding.

1. Americans first encoded their English characters, which is the earliest ascii code. The lower 7 bits of a byte are used to represent the 128 characters of English, and the upper 1 bit is unified as 0;

2. Later, Europeans discovered that Nepalese Ma, which 128 bits are enough? For example, there are phonetic symbols on my noble French letters. How to distinguish this? Well, let’s put the upper 1 bit in, so that Europe generally uses a full byte for encoding, and at most Represents 256 bits. Europeans and Americans just like to go straight, with fewer characters, and fewer digits are used for encoding;

3. But even with fewer digits, different countries and regions use different character encodings. Although the symbols represented by 0--127 are the same, 128--255 The explanation of this paragraph is completely messed up. Even if the binary system is exactly the same, the characters represented are completely different. For example, 135 is a completely different symbol in French, Hebrew, and Russian encoding;

4. What's more troublesome is that Nima After this high-tech computer was introduced to China, the Chinese people found that we have more than 100,000 Chinese characters, and these 256 characters in Europe and America are not enough to be inserted between the teeth. So GB2312 Chinese character codes were invented. Typically, 2 bytes are used to represent most of the commonly used Chinese characters, and it can represent up to 65536 Chinese characters, so it is not difficult to understand some Chinese characters. You can find them in the Xinhua dictionary, but the computer If you don't deal with it, you won't be able to show it.

5. Each of them is encoded with different character sets. How is the world unified? The Russians sent an email to the Chinese, and the character set encoding on both sides was different, and the Nima display was garbled. In order to unify, unicode was invented, which included all the symbols in the world, and each symbol was given a unique code. Now unicode can accommodate more than 1 million symbols, and the encoding of each symbol is different. It can be unified, all languages ​​can communicate with each other, and the languages ​​of various countries can be displayed on one web page at the same time.

6. However, although unicode unifies the binary encoding of characters all over the world, it does not stipulate how to store them, dear. The computer little-endian and big-endian order of x86 and amd architectures are not clear, let alone how the computer recognizes whether it is unicode or acsii. If Unicode uniformly stipulates that each symbol is represented by three or four bytes, then there must be two to three bytes before each English letter that is 0, and the size of the text file will be two or three times larger. It's a huge waste of storage. This leads to a consequence: there are multiple storage methods for Unicode.

7. With the rise of the Internet, to display various characters on web pages, it must be unified, dear. utf-8 is one of the most important implementations of Unicode. There are also utf-16, utf-32, etc. UTF-8 is not a fixed-length encoding, but a variable-length encoding. It can use 1 to 4 bytes to represent a symbol, and the byte length varies according to different symbols. This is a relatively clever design. If the first bit of a byte is 0, the byte is a character by itself; if the first bit is 1, then how many 1s are in a row indicates how many words the current character occupies Festival.

8. Note that the character encoding of unicode and the storage encoding of utf-8 are different. For example, the Unicode code of the word "strict" is 4E25, and the UTF-8 encoding is E4B8A5. As explained in this 7, UTF-8 encoding not only considers Considering the code and storage, E4B8A5 is stuffed into 4E25 on the basis of the storage identification code.

9. UTF-8 uses one to four bytes to encode each character. 128 ASCII characters (Unicode range from U+0000 to U+007F) in just one byte, Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac with diacritics Text and Maldivian (Unicode range from U+0080 to U+07FF) require two bytes, other Basic Multilingual Plane (BMP) characters (CJK falls into this category - Qieqie Note) use three bytes, other The characters of the Unicode auxiliary plane use a four-byte encoding. 

10. Finally, to answer your question, in general, how many bytes do Chinese characters occupy in utf-8, usually 3 bytes, the most common encoding method is 1110xxxx 10xxxxxx 10xxxxxx.


from:
https://zhidao.baidu.com/question/1047887004693001899.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326421688&siteId=291194637