ASCII, Unicode, UTF-8 encoding relationship

  Since the computer was invented by Americans, only 127 characters were encoded into the computer at the earliest, that is, uppercase and lowercase English letters, numbers and some symbols. This encoding table is called ASCIIencoding. For example A, the encoding of uppercase letters is 65, lowercase letters. zThe encoding is 122. But obviously one byte is not enough to process Chinese, at least two bytes are needed, and it cannot conflict with the ASCII encoding, so China has developed an GB2312encoding to compile Chinese into it. As you can imagine, there are hundreds of languages ​​in the world. Japan has Japanese compiled into Shift_JISit, and South Korea has compiled Korean into Euc-krit. Each country has its own standards, and conflicts will inevitably arise. As a result, in multilingual mixed In the text, there will be garbled characters displayed. Hence, Unicode came into being. Unicode unifies all languages ​​into one encoding, so there will be no more garbled problems.

  The Unicode standard is constantly evolving, but the most common is to use two bytes to represent a character (4 bytes if you want to use very remote characters). Modern operating systems and most programming languages ​​directly support Unicode.

Now, take a look at the difference between ASCII encoding and Unicode encoding: ASCII encoding is 1 byte, while Unicode encoding is usually 2 bytes.

  Letters Aare encoded in ASCII in decimal 65and binary 01000001;

  Characters 0are encoded in ASCII, which is decimal 48and binary 00110000. Note that characters '0'and integers 0are different;

  Chinese characters have gone beyond the scope of ASCII encoding, and Unicode encoding is decimal 20013and binary 01001110 00101101.

  If you Ause Unicode encoding for ASCII encoding, you only need to add 0 in front. Therefore, Athe Unicode encoding is 00000000 01000001.

  A new problem has appeared again: if it is unified into Unicode encoding, the problem of garbled characters has disappeared since then. However, if the text you write is basically all in English, using Unicode encoding requires twice as much storage space as ASCII encoding, which is very uneconomical in terms of storage and transmission.

  Therefore, in the spirit of saving, there is an encoding that converts Unicode encoding into "variable-length encoding" UTF-8. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different number sizes, commonly used English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, and only very rare characters will be encoded. Encoded into 4-6 bytes. If the text you want to transfer contains a lot of English characters, encoding in UTF-8 can save space:

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324736487&siteId=291194637