Those encodings in network transmission-UTF8 encoding talk

Why encoding is an important topic, because the main way we interact with computers is still text characters. As a programmer, I believe that most of them have been tortured by the problems of characters and encodings. From keyboard input of text characters to the editor, the editor stores characters to the hard disk, and how to represent and process characters in a specific programming language. It is said that character sets and character set encodings are ubiquitous in computers, so it is necessary to conduct a comprehensive review. This time, I will talk about my knowledge and understanding of characters and character encoding through this article, hoping to bring you something.

Character Sets and Character Encodings

Character set: a collection of a series of characters, the most famous is the Unicode character set, which covers most of the characters in the world, and assigns a number to each character for numbering, which can represent the character in the computer .

Character set encoding: Usually referred to as encoding, because characters need to be stored and transmitted in the computer, it is necessary to specify how many bytes a character uses and what value is stored on the hard disk or in memory. The encoding methods corresponding to the Unicode character set include UTF-8, UTF-16, UTF-32 and so on.

In addition to Unicode, the character sets we often come into contact with include ASCII, GB2312 and other character sets. For example, GB2312 only contains a collection of common Chinese characters, numbers and other characters, so some uncommon Chinese characters may not be displayed in some GB2312 character sets, and these uncommon characters will be included in the subsequent GBK. Of course, other countries such as Japan, South Korea, and Europe also have their own specific character sets. But they all have a common feature, that is, they are all extended from the ASCII character set, and are compatible with the ASCII character set. The following figure 1 illustrates the relationship between them:

insert image description here
figure 1

Seeing this, some people may ask questions that ASCII is obviously an encoding method, why is it said to be a character set here. The reason is that in the early days of computer development, there was no strict distinction between character sets and character set encodings. ASCII has both the functions of character set numbers and character encodings, so ASCII can be said to be both a character set and a character set encoding. The initial range of the ASCII character set is 0-127, and the character number itself can be directly stored in one byte, and the character number and its character code can be in one-to-one correspondence. In the same way, the character set range of GB2312 is 0~65535, which uses two bytes for storage, and there is also a one-to-one correspondence. There is no need to distinguish character set and character set encoding separately, and its encoding form is unique. Just because there are encoding models such as ASCII and GB2312 that do not effectively distinguish character sets and character encodings in history, many people call Unicode encoding after the emergence of Unicode. This is not particularly accurate. The accurate term is unicode character set .

Unicode defines most of the characters in the world, and numbers each character, and we can locate specific characters through this number. Due to the huge size of the Unicode character set, there are currently hundreds of thousands of characters, numbered from 0-10W+, resulting in how many bytes are used for character storage, which can not be designed simply like ASCII and GB2312. English character numbers within 0~127 can be stored in one byte; Chinese character numbers, such as the village boy at the beginning, are 6751 (26449), 4E2D (20013), 5C11 (23569), 5E74 (24180) , need at least 2 bytes to accommodate the next. For the sake of storage efficiency, the designer designed the UTF-8 variable-length encoding method; for the sake of design simplicity, there are encoding methods such as UTF-32. It can be seen that there are many encoding methods corresponding to the Unicode character set itself, and the values ​​obtained by character numbers and different encoding methods are different. Therefore, the concepts of character set and character encoding must be distinguished in Unicode. Unicode is usually called a character set or a Unicode number, and UTF-8 is the encoding method corresponding to the character set.

UTF-8 encoding

For students who don't know much about UTF-8 encoding, I will briefly talk about the UTF-8 encoding rules here, that is, the conversion relationship between UTF-8 and Unicode, otherwise there will always be some half-hidden feelings. At the same time, UTF-8, as the most widely used encoding method at present, is worth learning. Use a picture 2 to explain as follows:

insert image description here
figure 2

The left side corresponds to the value of the Unicode number, and the right side corresponds to the value after UTF-8 encoding. The black part is fixed, and the red part is the value corresponding to the specific Unicode number that needs to be filled. The general principle is to determine the corresponding UTF-8 encoding length and corresponding format according to the range of Unicode numbers. After converting the Unicode number to binary, fill in the right-to-left binary values ​​in the corresponding right-to-left positions of the red part in the above encoding format, and fill the remaining positions with 0.

Two examples are given to illustrate the above table:

The Unicode value corresponding to the character a is 0x0061, and its binary value is 1100001, corresponding to the storage method of the first line in the above table, which requires one byte storage, so put it in the corresponding position from right to left to get the storage content as 01100001, That is 0x61.
The corresponding Unicode value in Chinese characters is 0x4E2D, and its binary value is 100111000101101. Its range corresponds to the storage method of the third row in the above table, and three bytes are required for storage, so put them in the corresponding positions from the back to the front to get the storage content It is 111001001011100010101101, which is 0xE4B8AD.

Of course, for the characters of the village boy, you can use them to practice and see if the results stored by Nodepad++ in the above picture are consistent. In addition, I also found that the tool provided by Webmaster's Home, the result of converting Chinese to UTF-8 is not a real UTF-8 encoded value.

About Display and Storage

Obviously, if you can hold the entire GBK and Unicode character numbers in your head, the input method can be completely replaced, just follow the previous operation. Sadly, only a very small number of people are able to do this. It can be seen that the input method is for the case where the character number is not easy to remember, but the pinyin is easier to remember. The input method is used as an intermediate layer to do a conversion, converting the pinyin into a specific number. Of course, it is not friendly enough to directly display the number, and it is necessary to display the graphics corresponding to the character number, which is often a dot matrix or vector diagram of the character, which is like a font in Word software. When a certain character needs to be displayed, just find the corresponding bitmap according to the serial number and draw it. When we request to input the Unicode character 26449 into the English Linux terminal, the system will search for the character corresponding to 26449 is a village, and then draw in the terminal according to the dot matrix information of the character village. It is worth mentioning that in the development of computers, a lot of work is for human-computer interaction. Therefore, many similar intermediate layers have been produced, such as DNS. Since IP addresses are difficult to remember, domain names are easy to remember. Using DNS as an intermediate layer conversion is believed to be well understood by everyone in the daily work and study process.

For example, if you enter the character village in the English Linux terminal and press Enter, the following prompt may appear:

-bash: $’\346\235\221’:command not found

The reason is explained as follows:

1. \346\235\221 is the octal representation, converted to hexadecimal as E69D91, this value is the value corresponding to the UTF-8 encoding of the village character, note that the number of the village in Unicode is 0x6751, This reflects the difference between encoding and character numbering.

2. When the corresponding Unicode number is added through the input method or Alt, the corresponding character (that is, the dot matrix information of the corresponding number), such as village, is displayed on the terminal at the beginning. But when you press Enter, you are actually sending a command to the system (this command). When it comes to transmission and storage, characters need to be encoded. At this time, the terminal will convert the character into an encoded value for storage and transmission. Since my terminal defaults to UTF-8 encoding, the village (0x6751) will be encoded as E69D91, which is \346\235\221. Since there is no village command in the system, there will be the above error prompt.

The above example illustrates a problem, that is, the storage and display of the computer are separated. When the computer displays a specific character, it can also be understood as the number of the character displayed. But when storing, I can change the encoding method and store it as needed.

In the above example, the terminal adopts the UTF-8 encoding method. Similarly, in Nodepad++ and other software, I can store the words of the village boy in UTF-8 method, or in GBK method. The following are two ways to view using HxD The stored hexadecimal result is as follows:
GBK

insert image description here
UTF-8

insert image description here
It can be seen that although the stored values ​​are different, the characters of the village boy are all displayed when displayed. Because when Nodepad++ opens the file to display, it is actually the reverse process (decoding) from the encoded value to the character number. No matter what form the character is encoded in, the meaning it represents is the character itself when it is displayed, and will not change due to the change of the encoding value. Therefore, it is the content of the character itself to be displayed in the display stage, so several fixed values ​​are required to represent it, that is, the number of the character set in which it is located. But when a character is stored on the hard disk or stored in the memory or used for transmission, it must be considered to use a byte to store the character. At this time, the character needs to consider the space occupied and has a length limit, that is, the category of encoding . Different encodings may have different storage lengths for the same character. For example, UTF-8 Chinese usually uses three characters for storage, while GB2312 usually uses two bytes.

Therefore, the following conclusions can be drawn:

In the display stage of the computer operating system, the character set number of the character is used to uniquely represent the character. In the storage and display stage, according to different encoding settings, the character number is converted into an encoded value for transmission and storage.

a small game

There used to be such a small game: type your own Chinese name in Word without using the input method.

If you don't know how to implement it, congratulations, you have acquired a new skill after reading this article; if you have implemented it before, you may be very interested in the principle behind it.

Take the village boy from my CSDN blog as an example.

1. The numbers in GBK and GB2312 character sets corresponding to the four characters of the village boy are B4E5 (46309), D6D0
(54992), C9D9 (51673), and C4EA (50410). The numbers in the brackets are decimal, and the numbers outside the brackets are hexadecimal base. For common Chinese characters, the numbers of GBK and GB2312 are the same. It can be understood that GB2312 is a subset of GBK.

2. The numbers corresponding to these four characters in the Unicode character set are 6751 (26449), 4E2D (20013), 5C11 (23569), and 5E74 (24180).

3. In the Chinese Windows operating system, enter chcp in the DOS window, as shown in Figure 3 below:

insert image description here
In Figure 3
, the code page is the alias of the character set, and 936 refers to the GBK character set encoding. You can see that the GBK character set is used in my Chinese Windows operating system. Of course, it can also be said to be the GBK encoding. We will briefly talk about it later talk about difference

4. In the Windows Chinese operating system, press and hold the Alt key, and input the decimal corresponding to the character set in step 1 on the numeric keypad, which are 46309, 54992, 51673, and 50410 in sequence. Then you can print out a few Chinese characters of the boy in the village in the DOS window, Word document, Nodpad++ and other terminals or editors. Of course, there is a limitation that needs to be noted. The numeric keypad must be used to input numbers, so this method cannot be used for notebooks.

5. If your system does not use GBK character set, such as Unicode character set. For example, the English Linux system usually uses the Unicode character set. At this time, the combination is the Alt key plus the number corresponding to the Unicode character set in step 2, which are 26449, 20013, 23569, and 24180 in sequence. It is also possible to print out the youth in the village. Interested children's shoes can try for themselves.

This article is an original article by the youth in the village of CSDN, and may not be reproduced without permission. The blogger links here .

Guess you like

Origin blog.csdn.net/javajiawei/article/details/131202078