221014_175513-encoding Unicode, UTF-8 and GBK

foreword

Before introducing these encoding methods, let’s talk about ASCII codes. In a computer, 1 byte corresponds to an 8-bit binary number , and each binary number has two states of 0 and 1, so 1 byte can be combined to form 256 states. . If each of the 256 states corresponds to a symbol, 256 characters can be represented by 1 byte of data. So someone developed a set of codes to describe the correspondence between characters in English and these 8-bit binary numbers, which is called ASCII codes. But there are so many languages ​​in the world, how can 128 characters be enough? So Unicode encoding appeared.

1. Unicode encoding

Unicode does not stipulate how to store the binary code corresponding to the character. It contains all language characters in the world. The lower the character order is, the more bytes it needs to store and the larger the space it occupies . If you unify into Unicode encoding, the problem of garbled characters will be solved, but if you write all in English (ASCII should use the least byte characters), using Unicode encoding requires twice or more space than ASCII encoding , which will lead to a lot of general data during storage and network transmission, which greatly wastes space. After realizing this problem, UTF encoding was born.

2. UTF-8 encoding

Unicode Transformation Format, referred to as UTF, converts unicode characters to save space during storage and network transmission .

There are 3 versions of UTF encoding:

UTF-32: Use 4 bytes to represent all characters. Waste is reduced to a certain extent.

UTF-16: Use 2 and 4 bytes to represent all characters; use 2 bytes first, otherwise use 4 bytes. More efficient than UTF-32, but not perfect.

UTF-8 : use 1, 2, 3, 4 types of bytes to represent all characters, 1 byte is preferred, and one byte is added if it cannot be satisfied, up to 4 bytes. English occupies one byte, other European languages ​​occupy two, East Asia occupies three, and other special characters occupy four . perfect space saver

It should be noted that in the computer memory, Unicode encoding is uniformly used, and it is converted into UTF-8 encoding when it needs to be saved to the hard disk or needs to be transmitted .

3. GBK encoding

GBK encoding adopts a double-byte encoding scheme , and its encoding range: 8140-FEFE, excluding xx7F code points, a total of 23940 code points. A total of 21,886 Chinese characters and graphic symbols are included, including 21,003 Chinese characters (including radicals and components) and 883 graphic symbols. GBK encoding supports all Chinese, Japanese, and Korean Chinese characters in the international standard ISO/IEC10646-1 and national standard GB13000-1 , and includes all Chinese characters in the BIG5 encoding. Traditional Chinese characters are supported . The country has stipulated that all Microsoft software must be encoded in GBK by default when entering China, so the default encoding of the win system is also GBK.

Guess you like

Origin blog.csdn.net/liluo_2951121599/article/details/127325262