Know character set, ASCII, GBK, Unicode, UTF-8

1. Standard ASCII character set

ASCII (American Standard Code Information Interchange), American Standard Code for Information Interchange, including English, symbols, etc.

Standard ASCII uses 1 byte to store a character, with 0 at the beginning and end, and can represent a total of 128 characters, which is completely sufficient for Americans.

2. GBK (Chinese character internal code expansion specification, national standard)

The Chinese character coded character set contains more than 20,000 Chinese characters and other characters. A Chinese character in GBK is encoded into two bytes and stored .

Note: GBK is compatible with the ASCII character set.

GBK stipulates that the first bit of the first byte of a Chinese character must be 1, for example:

insert image description here

3. Unicode character set (unicode, also called universal code)

Unicode is a character set formulated by an international organization that can accommodate all characters and symbols in the world.

In UTF-32, 4 bytes represent a character, and its capacity is large. The disadvantage is that it takes up storage space and the communication efficiency becomes low.

UTF-8 is an encoding scheme of the Unicode character set. It adopts a variable-length encoding scheme and is divided into four length areas: 1 byte, 2 bytes, 3 bytes, and 4 bytes.

English characters, numbers, etc. only occupy 1 byte (compatible with standard ASCII encoding), and Chinese characters occupy 3 bytes.

UTF-8 encoding method (binary):

insert image description here

example:

insert image description here
Note: Technicians should use UTF-8 encoding when developing!

Summarize:
insert image description here

Points to note:
1. The character set used in character encoding must be consistent with the character set used in decoding, otherwise garbled characters will appear.
2. English and numbers are generally not garbled, because many character sets are compatible with ASCII encoding.

Four, iso-8859-1

The range of characters represented by the iso8859-1 encoding is very narrow and cannot represent Chinese characters. However, because it is a single-byte encoding, which is consistent with the most basic representation unit of a computer, it is still represented by iso8859-1 encoding in many cases. And on many protocols, this encoding is used by default.

Although the word "Chinese" does not have iso8859-1 encoding, taking gb2312 encoding as an example, it should be the two characters of 'd6d0 cec4'. When using iso8859-1 encoding, it will be disassembled into 4 bytes to represent: 'd6 d0 ce c4' (in fact, when storing, it is also processed in bytes). And if it is UTF encoding, it is 6 bytes 'e4 b8 ad e6 96 87'. Obviously, this representation needs to be based on another encoding.

Guess you like

Origin blog.csdn.net/KevinChen2019/article/details/127678239