ASCII, GB2312, GBK, GB18030, unicode, UTF-8 character set encoding

ASCII character set encoding

ASCII character set encoding ASCII code is a 7-bit encoding , and the encoding range is 0x00-0x7F. The ASCII character set includes characters such as English letters, Arabic numerals, and punctuation marks. There are 33 control characters 0x00-0x20 and 0x7F.

A system that only supports ASCII codes will ignore the highest bit of each byte and only consider the lower 7 bits as valid bits. HZ character encoding is an encoding designed to transmit Chinese in a system that only supports 7-bit ASCII. In the early days, many mail systems only supported ASCII encoding. In order to transmit Chinese mail, BASE64 or other encoding methods must be used.

 

GB2312 character set encoding

GB2312 is the code name for the Chinese character set and encoding. The full Chinese name is "Chinese character encoding character set for information exchange". It was promulgated by the State Administration of Standards of the People's Republic of China and implemented on May 1, 1981. GB is the Chinese Pinyin abbreviation of the word "国标".

The GB2312 character set only includes simplified Chinese characters, as well as commonly used letters and symbols, and is mainly used in mainland China and Singapore. GB2312 contains 
7445 characters, including
 6763 simplified Chinese characters and 682 letters and symbols.

GB2312 divides the recorded characters into 94 areas, numbered from 01 to 94; each area contains 94 characters, numbered from 01 to 94. Each character of GB2312 is determined by its uniquely corresponding area code and bit number. For example: the Chinese character "啊", the number is 16 area 01 digits.

Location distribution table of GB2312 character set:

Area code Word count Character type
01 94 General symbol
02 72 Sequence number
03 94 Latin alphabet
04 83 Japanese Kana
05 86 Katakana
06 48 Greek alphabet
07 66 Russian letters
08 63 Hanyu Pinyin
09 76 Graphic symbols
10-15 Spare area
16-55 3755 First-level Chinese characters, in alphabetical order
56-87 3008 Level 2 Chinese characters, in order of strokes
88-94 Spare area

 

GB2312 encoding

GB2312 original encoding (encoding) is to use two bytes (byte) for each character included. The first byte is the "high byte", which is formed by adding 32 to the area code value of the character; the second byte is the "low byte", which is formed by adding 32 to the bit number value of the character. For example: the Chinese character "啊", the number is 16 area 01 digits. Its high byte is 16 + 32 = 48 (0x30), the low byte is 01 + 32 = 33 (0x21), and the combined code is 0x3021.

The reason for adding 32 to the location number value is to avoid the low byte interval.

Due to the overlap between the original GB2312 code and the ASCII code, the current GB2312 code is modified by adding 128 to the two bytes of the original code. For example: the Chinese character "啊", the number is 16 area 01 digits. Its original code is 0x3021, and the pass code is 0xB0A1.

Unless otherwise stated, GB2312 often refers to this modified code. 

Each Chinese character of GB2312 is composed of two bytes, each of which ranges from 0xA1 to 0xFE, and each byte has 94 encoding ranges, which correspond exactly to the number of location codes.

EUC-CN can be understood as an alias of GB2312, which is exactly the same as GB2312. 

The location code should be regarded as the definition of the character set, which defines the characters and character positions included, and GB2312 and EUC-CN are the codes that support this character set in the actual computer environment. HZ and ISO-2022-CN are the other two codes corresponding to the location code character set, and both use 7-bit code space to support Chinese characters. The relationship between location code and GB2312 encoding is a bit like Unicode and UTF-8.


GBK character set encoding

GBK encoding is a superset of GB2312 encoding and is fully compatible with GB2312 downwards. At the same time, GBK includes all CJK Chinese characters in the basic multilingual plane of Unicode. Like GB2312, GBK also supports Greek letters, Japanese kana letters, Russian letters and other characters, but does not support phonetic characters (non-Chinese characters) in Korean. GBK also includes Chinese characters such as radical symbols and vertical punctuation marks that are not included in GB2312.

The overall encoding range of GBK is
: the high byte range is 0×81-0xFE, the low byte range is 0x40-7E and 0x80-0xFE, excluding the low byte is the combination of 0×7F. 

The GBK characters whose low byte is 0x40-0x7E have a certain particularity, because these characters occupy the position of the ASCII code, which will cause trouble for some systems.

In some systems, characters in 0x40-0x7E (such as "|") are used as special symbols, and when these symbols are located, it is not judged whether these symbols belong to the low byte of a GBK character, which will cause wrong judgments. This problem does not exist in an environment that supports GB2312. It should be noted that a certain byte less than 0x80 in the environment that supports GBK may not be an ASCII symbol; in addition, it is best to use ASCII symbols less than 0×40 as some special symbols, so that you can quickly locate, and don’t worry about it The other half of the Chinese character. Corresponding problems also exist in Big5 encoding.
There is a slight difference between CP936 and GBK. In most cases, CP936 can be used as an alias for GBK.


GB18030 character set encoding

The GB18030 encoding is downward compatible with GBK and GB2312, and the meaning of compatibility is not only the characters are compatible, but the encoding of the same characters is also the same. GB18030 contains all the characters in Unicode3.1, including Chinese minority characters, Korean characters not supported by GBK, etc. It can also be said that most of the world's national characters are included. 

Both GBK and GB2312 are double-byte constant-width encodings. If you count the single-byte compatible with ASCII, you can also understand it as a mixed variable-length encoding of single-byte and double-byte. GB18030 encoding is a variable-length encoding, with three methods: single-byte, double-byte and four-byte.

The single-byte encoding range of GB18030 is 0x00-0x7F, which is completely equivalent to ASCII; the double-byte encoding range is the same as GBK, the high byte is 0x81-0xFE, and the low byte encoding range is 0x40 -0x7E and 0x80-FE; The encoding range of the first and third bytes in the four-byte encoding is 0x81-0xFE, and the second and fourth bytes are 0x30-0x39. 

The CP936 code page in Windows uses 0x80 to represent the Euro symbol, while the 0x80 code bit is not used in the GB18030 code, and other positions are used to represent the Euro symbol. This can be understood as a small problem with GB18030 backward compatibility; it can also be understood as 0x80 is an extension of CP936 to GBK, and GB18030 is only compatible with GBK. 


Unicode character set encoding

   The different code pages of each language increase the complexity of software that needs to support different languages. So people developed a world standard called unicode. Unicode provides a unique specific value for each character, no matter what platform, no matter in what software, and no matter what language. In other words, it lists all the characters used in the world, and gives each character a unique specific value.

The original goal of Unicode was to use a 16-bit encoding to provide mapping for more than 65,000 characters. But this is not enough. It cannot cover all the text in history, nor can it solve the problem of transmission (implantation head-ache's), especially in those web-based applications. Existing software must do a lot of work to program 16-bit data.
Therefore, Unicode has formulated three sets of encoding methods with some basic reserved characters. They are UTF-8, UTF-16 and UTF-32. As the name suggests, in UTF-8, characters are encoded in 8-bit sequences, using one or several bytes to represent a character. The biggest advantage of this method is that UTF-8 retains the encoding of ASCII characters as part of it. For example, in UTF-8 and ASCII, the encoding of "A" is 0x41.

UTF-16 and UTF-32 are 16-bit and 32-bit encoding methods of Unicode, respectively. Considering the original purpose, Unicode is usually referred to as UTF-16. When discussing Unicode, it is very important to figure out which encoding method.


UTF-8 character set encoding

Unicode Transformation Format-8bit, allows BOM, but usually does not contain BOM. It is a multi-byte encoding used to solve international characters. It uses 8 bits (that is, one byte) for English and 24 (three bytes) for encoding. UTF-8 contains characters that all countries in the world need to use. It is an international code and has strong versatility. UTF-8 encoded text can be displayed on browsers that support UTF8 character set in various countries. For example, if it is UTF8 encoding, Chinese can also be displayed on the English IE of foreigners, and they do not need to download the Chinese language support package of IE.

The character encoding of GBK is expressed by double bytes, that is, both Chinese and English characters are expressed by double bytes. In order to distinguish Chinese, the highest bit is set to 1. GBK contains all Chinese characters and is a national code. Its versatility is worse than UTF8, but UTF8 occupies a larger database than GBD.


GBK, GB2312, etc. and UTF8 must be converted to each other through Unicode encoding:

GBK、GB2312--Unicode--UTF8

UTF8--Unicode--GBK、GB2312
 

Guess you like

Origin blog.csdn.net/my_angle2016/article/details/115249693