The difference between ASCII, Unicode and UTF-8 encoding

The difference between ASCII, Unicode and UTF-8 encoding

Abstract summary:

ASCII的编码是128个字符
GB2312编码用来把中文编进去的,日本把日文编进Shift_JIS里...
Unicode是为了解决各国乱码的,但浪费存储空间
UTF-8编码把一个Unicode字符根据不同的数字大小编码成1-6字节,英文字符是1个字节,汉字通常是3个字节,生僻字符是4-6个字节

List of common coding introductions:

coding effect Occupied bytes
ASCII Represents English and Western European languages 1bytes
GB2312 National Simplified Chinese character set, compatible with ASCII 2bytes
Unicode National standard organization unified standard character set 2bytes
GBK Extended character set of GB2312, support traditional Chinese characters, compatible with GB2312 2bytes
UTF-8 variable length encoding 1-3bytes

Specifically explain that 127 letters are encoded into the computer, that is, uppercase and lowercase English letters, numbers and some symbols. This encoding table is called ASCII encoding. For example, the upper and lower case letter A represents 65, and the lowercase letter a represents 97.

But obviously one byte is not enough to process Chinese, at least two bytes are needed, and it cannot conflict with ASCII encoding, so China has developed GB2312 encoding to encode Chinese.

As you can imagine, there are hundreds of languages ​​in the world. Japan has compiled Japanese into Shift_JIS, and South Korea has compiled Korean into Euc-kr. Each country has its own standards, and conflicts inevitably arise. The result is: in In the multilingual mixed text, it will display garbled characters.

Therefore, Unicode came into being. Unicode unifies all languages ​​into a set of encodings, so that there will be no problems.

The Unicode standard is constantly evolving, but the most common is to use two bytes to represent a character (4 bytes if you want to use very remote characters). Modern operating systems and most programming languages ​​directly support Unicode.

A new problem has appeared again: if it is uniformly replaced with Unicode encoding, the problem of garbled characters will disappear. However, if the articles are written in English, Unicode encoding requires twice as much storage space as ASCII encoding. It's not worth it.

Therefore, in the spirit of saving, UTF-8 encoding, which converts Unicode encoding into "variable-length encoding", appeared.

UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different numbers, commonly used English occupies 1 byte, Chinese characters are usually 3 bytes, only very rare characters will be encoded into 4 -6 bytes. If the text you are transferring contains a lot of English characters, encoding in UTF-8 can save space.

UTF-8 encoding has the added benefit that ASCII encoding can actually be seen as part of UTF-8 encoding, so a lot of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324858034&siteId=291194637