Design and evolution of character encoding (ASCII, Unicode, UTF-8, GB2312...)

The problem of character encoding seems to be small and is often overlooked by technicians, but it can easily lead to some inexplicable problems. Here is a summary of some popular knowledge of character encoding, I hope it will be helpful to everyone.

1. Speaking from ASCII code

Speaking of character encoding, I have to say a brief history of ASCII code. When computers were first invented, they were used to solve digital computing problems. Later, people discovered that computers can do more things, such as text processing. But because computers only recognize "numbers", people must tell the computer which number represents which particular character, for example, 65 represents the letter'A', 66 represents the letter'B', and so on. But the character-number correspondence between computers must be consistent, otherwise it will cause the same segment of numbers to display different characters on different computers . Therefore, the American National Standards Institute ANSI has developed a standard that specifies the set of commonly used characters and the number corresponding to each character. This is the ASCII character set (Character Set), also known as the ASCII code.

Computers at that time generally used 8-bit bytes as the smallest storage and processing unit. In addition, there were few characters used at that time. There were 26 uppercase and lowercase English letters and numbers plus other commonly used symbols, which were less than 100. Therefore, The ASCII code can be stored and processed efficiently by using 7 bits, and the most significant 1 bit is used as the parity check of some communication systems.

Note that byte represents the smallest unit that the system can handle, not necessarily 8 bits. It's just that the de facto standard of modern computers is to use 8 bits to represent a byte. In many technical specification documents, in order to avoid ambiguity, it is more inclined to use the term Octet instead of Byte to emphasize the 8-bit binary stream. In order to facilitate understanding, I will continue to use the familiar concept of "bytes".

ASCII character set

The ASCII character set consists of 95 printable characters (0x20-0x7E) and 33 control characters (0x00-0x19, 0x7F). Printable characters are used to display on output devices, such as screens or printing paper. Control characters are used to send some special instructions to the computer. For example, 0x07 will make the computer beep. 0x00 is usually used to indicate the end of a string, 0x0D And 0x0A is used to instruct the printer's print needle to return to the beginning of the line (carriage return) and move to the next line (line feed).

The character encoding and decoding system at that time was very simple, it was a simple look-up process. For example, to encode a character sequence into a binary stream and write it to the storage device, it is only necessary to find the byte corresponding to the character in the ASCII character set in turn, and then directly write the byte to the storage device. The process of decoding a binary stream is similar.

2. Derivation of OEM character set

When computers began to develop, people gradually discovered that the poor 128 characters in the ASCII character set could no longer meet their needs. People are thinking that there are 256 numbers (numbers) that can be represented by a byte, and ASCII characters only use 0x00~0x7F, which means that the first 128 numbers are occupied, and the latter 128 numbers are not used in vain, so many people start to call them. The idea behind these 128 numbers. But the problem is that many people have this idea at the same time, but everyone has their own ideas about what kind of characters the 128 numbers behind 0x80-0xFF correspond to. This led to the appearance of a large variety of OEM character sets on machines sold all over the world.

The following table is one of the OEM character sets launched by the IBM-PC. The first 128 characters of the character set are basically the same as the ASCII character set (why it is basically the same, because the first 32 control characters are in some cases It will be interpreted as a printable character by the IBM-PC), and some accented characters used in European countries and some characters used for line drawing are added to the 128 character space behind.

IBM-PC OEM character set

In fact, most OEM character sets are compatible with ASCII character sets. That is to say, everyone's interpretation of the range of 0x00-0x7F is basically the same, but the interpretation of the second half of 0x80~0xFF is not necessarily the same. Even sometimes the same character corresponds to different bytes in different OEM character sets.

Different OEM character sets make it impossible for people to communicate various documents across machines. For example, employee A sends a resume résumés to employee B, but employee B sees r?sum?s, because the corresponding byte of the é character in the OEM character set on employee A’s machine is 0x82, while in employee B On different machines, due to the different OEM character set used, the character obtained after decoding the 0x82 byte is ?.

3. Multi-byte character set (MBCS) and Chinese character set

The character sets we mentioned above are all based on single-byte encoding, that is, one byte is translated into one character. This may not be a problem for Latin-speaking countries, because they can get 256 characters by extending the 8th bit, which is enough. But for Asian countries, 256 characters are far from enough. Therefore, in order to use computers and maintain compatibility with the ASCII character set, people in these countries invented the multi-byte encoding method. The corresponding character set is called the multi-byte character set. For example, China uses Double Byte Character Set (DBCS).

For single-byte character sets, only one code table is required in the code page, on which the characters represented by 256 numbers are recorded. The program only needs to do a simple table lookup operation to complete the encoding and decoding process.

Code page is the concrete realization of character set encoding, you can understand it as a "character-byte" mapping table, and realize the "character-byte" translation by looking up the table. There will be a more detailed description below.

For multi-byte character sets, there are usually many code tables in the code page. So how does the program know which code table to use to decode the binary stream? The answer is to choose different code tables for analysis based on the first byte .

For example, GB2312, the most commonly used Chinese character set, covers all simplified characters and some other characters; GBK (K stands for extended meaning) adds traditional characters and other non-simplified characters on the basis of GB2312 (GB18030 character set is not double Byte character set, we will mention it when we talk about Unicode). The characters of these two character sets are represented by 1-2 bytes. The Windows system uses the 936 code page to encode and decode the GBK character set. When parsing the byte stream, if the highest bit of the byte is 0, then the first code table in the 936 code page is used for decoding, which is consistent with the codec of the single-byte character set .

GBK character set

When the high bit of the byte is 1, to be precise, when the first byte is between 0x81 – 0xFE, find the corresponding code table in the code page according to the difference of the first byte, for example, when the first byte The byte is 0x81, then corresponds to the following code table in 936:

0x81 corresponds to 936 code table

According to the code table of the 936 code page, when the program encounters the continuous byte stream 0x81 0x40, it will be decoded into the "丂" character.

4. ANSI standards, national standards, ISO standards

The emergence of different ASCII-derived character sets has made document communication very difficult, so various organizations have successively carried out standardized processes. For example, the American ANSI organization has developed the ANSI standard character encoding (note that we usually talk about the ANSI encoding now, which usually refers to the default encoding of the platform, for example, ISO-8859-1 in the English operating system and GBK in the Chinese system ), ISO organization Various ISO standard character encodings have been developed, and countries will also develop some national standard character sets, such as China's GBK, GB2312 and GB18030.

When the operating system is released, these standard character sets and platform-specific character sets are usually pre-installed on the machine, so as long as your documents are written in standard character sets, the versatility will be relatively high. For example, a document written in the GB2312 character set can be displayed correctly on any machine in mainland China. At the same time, we can also read documents in multiple countries and different languages ​​on one machine, provided that the machine must install the character set used by the document.

5. Emergence of Unicode character set

Although by using different character sets, we can view documents in different languages ​​on one machine, we still cannot solve a problem: displaying all characters in one document. In order to solve this problem, we need a huge character set that all mankind has reached a consensus. This is the Unicode character set .

The Unicode character set covers all the characters currently used by humans, and each character is uniformly numbered and assigned a unique character code (Code Point). The Unicode character set divides all characters into 17 planes according to the frequency of use, and each plane has 2 16 = 65536 character code spaces.

Unicode character set level

Among them, the 0th level, BMP, basically covers all characters used in the world today. The other levels are either used to represent some ancient texts or reserved for expansion. The Unicode characters we usually use are generally located at the BMP level. There is still a large amount of unused character space in the Unicode character set.

6. Changes in the coding system

Before the advent of Unicode, all character sets were bound to specific encoding schemes, which directly bound characters to the final byte stream. For example, the ASCII encoding system stipulates that 7 bits are used to encode the ASCII character set; The GB2312 and GBK character sets limit the use of up to 2 bytes to encode all characters and specify the endianness. Such an encoding system usually uses a simple table lookup, that is, through the code page, characters can be directly mapped to the byte stream on the storage device. For example, the following example:

Coding scheme and character set binding example

The disadvantage of this approach is that the coupling between characters and byte streams is too tight, which limits the expansion capability of the character set. Assuming that Martians will live on the earth in the future, it will become difficult or impossible to add Martian text to the existing character set, and it is easy to break the existing coding rules.

Therefore, Unicode is designed with this in mind, separating the character set and character encoding scheme.

Example of separation of encoding system and character set

In other words, although each character can find a unique number (character code, also called Unicode code) in the Unicode character set, it is the specific character code that determines the final byte stream . For example, the Unicode character "A" is also encoded, the byte stream obtained by UTF-8 character encoding is 0x41, and the byte stream obtained by UTF-16 (big endian mode) is 0x00 0x41.

7. Common Unicode encoding

UCS-2/UTF-16

If we were to implement the encoding scheme of BMP characters in the Unicode character set, how would we implement it? Since there are 2 16 = 65536 character codes on the BMP level , we only need two bytes to fully represent all the characters.

For example, the Unicode character code of "中" is 0x4E2D (01001110 00101101), then we can encode it as 01001110 00101101 (big endian) or 00101101 01001110 (little endian).

UCS-2 and UTF-16 both use 2 bytes to represent the characters at the BMP level, and the results obtained by encoding are exactly the same. The difference is that when UCS-2 was originally designed, only BMP characters were considered, so it uses a fixed 2-byte length, that is, it cannot represent characters on other levels of Unicode, and UTF-16 in order to remove this restriction, Support the encoding and decoding of Unicode full character set, using variable length encoding, at least 2 bytes are used. If you want to encode characters other than BMP, you need 4 bytes to pair . I won't discuss that far here . If you are interested, you can refer to the wiki Encyclopedia: UTF-16/UCS-2 .

Windows has adopted UTF-16 encoding since the NT era. Many popular programming platforms, such as .Net, Java, Qt, and Cocoa under Mac, use UTF-16 as the basic character encoding. For example, the string in the code, the corresponding byte stream in the memory is encoded with UTF-16.

UTF-8

UTF-8 should be the most widely used Unicode encoding scheme. Since UCS-2/UTF-16 uses two bytes for encoding ASCII characters, the storage and processing efficiency is relatively low, and because the two bytes obtained after UTF-16 encoding of ASCII characters, the high byte is always 0x00. Many C language functions treat this byte as the end of the string, resulting in the inability to parse the text correctly. Therefore, when it was first introduced, it was resisted by many western countries, which greatly affected the implementation of Unicode. Later smart people invented the UTF-8 encoding to solve this problem.

The UTF-8 encoding scheme uses 1-4 bytes to encode characters, and the method is actually very simple.

UTF-8 variable length encoding

Note: The x in the above figure represents the low 8 bits of the Unicode code, and y represents the high 8 bits

The encoding of ASCII characters uses single byte, which is exactly the same as ASCII encoding, so that all documents that originally used ASCII encoding and decoding can be directly converted to UTF-8 encoding. For other characters, it is represented by 2-4 bytes, where the number of 1 before the first byte represents the number of bytes required for correct analysis, and the upper 2 bits of the remaining bytes are always 10 . For example, the first byte is 1110yyyy, and there are 3 1s in the front, indicating that a total of 3 bytes are required for correct parsing, and it needs to be combined with the following 2 bytes starting with 10 to correctly parse the characters. For more information about UTF-8, refer to Wikipedia: UTF-8 .

GB18030

Any encoding that can map Unicode characters into a byte stream is a Unicode encoding. China's GB18030 encoding covers all the characters of Unicode, so it can be regarded as a Unicode encoding. It's just that its encoding method is not like UTF-8 or UTF-16. The number of Unicode characters is converted through certain rules, and it can only be encoded by looking up the table. For more information about GB18030, refer to: GB18030 .

8. Frequently asked questions about Unicode

Is Unicode two bytes?

Unicode just defines a huge, universal character set, and specifies a unique number for each character. The specific byte stream stored in depends on the character encoding scheme. The recommended Unicode encodings are UTF-16 and UTF-8.

What does signed UTF-8 mean?

Signed means that the byte stream starts with the BOM mark. Many software will "intelligently" detect the character encoding used by the current byte stream. For efficiency reasons, this detection process usually extracts the first few bytes of the byte stream to see if it meets the encoding rules of some common character encodings. . Since UTF-8 and ASCII encoding are the same for pure English encoding, they cannot be distinguished. Therefore, by adding a BOM mark at the top of the byte stream, you can tell the software that the current Unicode encoding is used, and the success rate is very accurate. . However, it should be noted that not all software or programs can correctly process the BOM mark. For example, PHP will not detect the BOM mark and directly parse it as a normal byte stream. Therefore, if your PHP file is encoded in UTF-8 with BOM mark, there may be problems.

What is the difference between Unicode encoding and the previous character set encoding?

Early concepts such as character encoding, character set, and code page all expressed the same meaning. For example, the GB2312 character set, GB2312 encoding, and 936 code page are actually the same thing. However, it is different for Unicode. The Unicode character set only defines the collection and unique number of characters, while the Unicode encoding is a general term for specific encoding schemes such as UTF-8 and UCS-2/UTF-16, not a specific encoding scheme. So when you need to use character encoding, you can write gb2312, codepage936, utf-8, utf-16, but please do not write unicode (I have seen others write charset=unicode in the meta tag of the webpage, and sent it with feelings) .

Garbled problem

Garbled means that the character text displayed by the program cannot be interpreted in any language. Under normal circumstances will contain a large number of ?. Garbled is a problem that all computer users will encounter more or less. The reason for the garbled code is that the wrong character encoding is used to decode the byte stream. Therefore, when we are thinking about any issues related to text display, please stay awake at all times: what is the character encoding currently used. Only in this way can we correctly analyze and deal with garbled codes.

For example, the most common webpage garbled problem. If you are a website technician and encounter such a problem, you need to check the following reasons:

  • The response header Content-Type returned by the server does not specify the character encoding
  • Whether the META HTTP-EQUIV tag is used to specify the character encoding in the webpage
  • Whether the character encoding used when the webpage file itself is stored is consistent with the character encoding declared by the webpage

Web page specified encoding scheme

Note that if the character encoding used in the web page parsing process is incorrect, it may also cause errors in the script or style sheet.

Not long ago, I saw feedback from a technical forum that the WinForm program uses the GetData method of the Clipboard class to access the HTML content in the clipboard. There will be garbled problems. I guess it is also because WinForm is not useful when obtaining HTML text. Caused by character encoding. The Windows clipboard only supports UTF-8 encoding, which means that the text you pass in will be encoded and decoded by UTF-8. In this way, as long as the two programs call the Windows clipboard API programming, there will be no garbled characters during the copy and paste process. Unless one party uses the wrong character encoding for decoding after obtaining the clipboard data, it will get garbled characters (I did a simple WinForm clipboard programming experiment and found that GetData uses the system default encoding instead of UTF-8 encoding. ).

Regarding the appearance of? In the garbled code, I need to mention here that when the program uses a specific character encoding to parse the byte stream, once it encounters a byte stream that cannot be parsed, it will use? Instead. Therefore, once the text you finally parse contains such characters, and you cannot get the original byte stream, it means that the correct information has been completely lost. Trying any character encoding will not be able to restore the correct character text from such characters. The information comes.

9. Explanation of necessary terms

Character Set (Character Set) , literally understood as a set of characters, for example, the ASCII character set defines 128 characters; GB2312 defines 7445 characters. The character set mentioned in the computer system, to be precise, refers to an ordered collection of numbered characters (not necessarily consecutive).

Character code (Code Point) refers to the digital number of each character in the character set. For example, the ASCII character set uses the continuous 128 numbers of 0-127 to represent 128 characters; the GBK character set uses zone codes to number each character. First, a 94 X 94 matrix is ​​defined, and the rows are called "zones". The column is called "bit", and then all the national standard Chinese characters are put into the matrix, so that each Chinese character can be identified by a unique "location" code. For example, the word "中" is placed in the 48th position in the 54 area, so the character code is 5448. In Unicode, the character set is divided into 17 levels (Planes) from 0 to 16 according to certain categories. Each level has 2 16 = 65536 character codes. Therefore, the total character code of Unicode is also Unicode. The character space has a total of 17*65536=1114112.

Unicode character code level

The process of encoding is to convert characters into a byte stream.

The process of decoding is to parse the byte stream into characters.

Character Encoding (Character Encoding) is a specific implementation scheme that maps character codes in a character set to a byte stream. For example, ASCII character encoding stipulates that the lower 7 bits of a single byte are used to encode all characters. For example, the number of'A' is 65, which is 0x41 in a single byte, so it is b'01000001' when writing to the storage device. GBK encoding is to add the offset of 0xA0 (160) to the area code and the bit code in the area code (GBK character code) respectively (the reason why such an offset is added is mainly to be compatible with the ASCII code) For example, for the word "中" just mentioned, the location code is 5448, and the hexadecimal code is 0x3630. After adding the offset of 0xA0 to the area code and the bit code respectively, 0xD6D0 is obtained. This is the GBK encoding result of the word "中" .

Code Page A specific form of character encoding. In the early days, there were relatively few characters, so a table-like form was usually used to directly map the characters into a byte stream, and then the encoding and decoding of the characters was realized by looking up the table. Modern operating systems follow this approach. For example, Windows uses the 936 code page, and the Mac system uses the EUC-CN code page to implement the encoding of the GBK character set. Although the names are different, the encoding of the same Chinese character must be the same.

The big and small saying comes from "Gulliver's Travels." We know that eggs are usually larger at one end and smaller at the other end. People in the Lilliputian country have different opinions on which end should be peeled off. Similarly, in the computer world, when transmitting a multi-byte word (a data type is represented by multiple bytes), it is different whether the high byte (big endian) is transmitted first or the low byte (little endian) is transmitted first. In my opinion, this is the origin of the small-endian model in computers. Whether it is writing files or network transmission, it is actually the process of writing to the stream device, and this writing operation starts from the low address of the stream to the high address (this is very consistent with human habits), for multi-byte words In other words, if the upper byte is written first, it is called big-endian mode. The opposite is called little-endian mode. In other words, in the big-endian mode, the byte order and the address order of the stream device are reversed, while the little-endian mode is the same. Generally, network protocols use big-endian mode for transmission, and windows operating system uses UTF-16 little-endian mode.

—— Kevin Yang

Reference link:

Guess you like

Origin blog.csdn.net/m0_37621078/article/details/107601088