Character encoding and garbled characters for programming introduction

  ——"Why is the request received by the server or the text file opened sometimes garbled?"

  ——"Because the encoding is wrong."

  ——"What is the essence of encoding? Why is there garbled characters if the encoding is incorrect? How is a piece of text finally displayed to the user after being transmitted over the network? What encoding does Java String use by default?"

  ——“……”

  I believe that many students have encountered and solved the garbled problem, but according to personal interview experience, there are very few students who know the problem and why. So I will share it here in case I am asked in other interviews :-).

  Because computers have been invented for a long time, "don't reinvent the wheel" is also a well-known old saying. We are used to the convenience of writing Print("A"), and a character A will be displayed on the screen, thinking that all this is natural. Fewer and fewer people are thinking about what support is needed and what happened during the process. Let's deduce how to implement a "notepad" program that displays characters in the era when the wheels are not so complete.

  1. Storage of text

  .txt files are very common. When we right-click on the windows desktop to create a new "text document", enter A in it and save it, a text document A with A saved is formed on the desktop. Then we double-click it to open it, and we will see this saved A.

  

  The school courses tell us that all binary data such as 0 and 1 are stored in the computer, and "A" cannot be stored, so what exactly is stored on the disk? We use another type of tool to open this text document. This type of tool is called a hexadecimal editor, and the HxD editor is used here. Display the following content.

  

   This shows that the actual storage content is a byte 0x41, and the corresponding text is A. At this time, we input a byte 0x42 after the byte 41, and the corresponding text shows B

  

  After saving, open the file with two bytes stored in Notepad, and AB will also be displayed, that is, we entered the character B by entering 0x42.

  

  

  Two, the character set

  The above reveals that after I input the character A in the Notepad program and save it, the actual number stored in the disk file is 0x41 (corresponding to binary 0100 0001), and if I directly append a 0x42 in the hexadecimal editor, then use Notepad This opens and displays B. Therefore, the Notepad program must have a conversion function. This conversion rule may be to input a character A, and the conversion is stored as 0x41. Conversely, when reading, if it is 0x41, it will display character A, and if it is 0x42, it will display B. In fact, it can be understood as a storage code and displays the decoding process. Obviously, there are 26 letters, and there may be 27 including upper and lower case, plus some addition, subtraction, multiplication and division, and love symbols, so we need to fully define this correspondence. After the definition of commonly used characters is completed, the following may be obtained A table is the legendary ascii character set.

  

  This table defines the correspondence between characters and binary data stored in the computer. Therefore, to implement the Notepad program, it is essentially a program that converts binary to visible characters. When inputting characters, it is stored as binary, and when reading binary, it is converted to character display. Is it that the seemingly simple notepad is a little more complicated than imagined. But character sets are abstract. The so-called abstraction means that after the encoding of the character is defined, a character A cannot be displayed on the screen. Next, we need to consider what specific work needs to be done to display a character A on the screen.   

  3. Font library (font)

  The A displayed on the screen is actually a graph, and the process of displaying A essentially requires drawing a graph in the shape of A on the screen. And there are many ways to write A, such as the following are all A. Therefore, we need to specifically define what we need to draw A when we want to display character A. Of course, we also need to define B, C, D, etc. in the same way.

   I can hard-code this definition in my "Notepad program", then this definition is a private definition, and the saved file cannot be displayed correctly when I save it in other text editors. . Because other programs might draw the A differently. A better idea is to make this definition public and define a standard format so that everyone can parse it and compile the file that defines the character shape to ensure the universal consistency of the display. This is called a font file.

  In practice, both character set definitions and font file definitions are standard and public, so that programs in the system can map the same file content to the same character, and use the same font to display if they want to keep the style consistent.

  The definition of character shape (font) undoubtedly includes at least two elements, the graphic of the character and the index code of the character. When the program needs to draw the character A, you can use the code 0x41 to search the corresponding font definition in the font file, and then call other drawing APIs to "draw" A. The drawing API can be understood as some relatively low-level drawing methods, and the implementation is similar to the first A row of pixels in the first column is displayed as black, which drives the display chip to draw on the display.

  Pixel fonts are an intuitive way to define

  

  According to the definition, it is drawn to the corresponding screen pixel to form a text. Of course, the problem is that the scaling may be blurred, and the font size information may be added to the definition, and a series of different pixel dot matrixes are defined for different font sizes to improve the display effect. It is also easy to think of an advanced approach, which is to use mathematical description to define the shape of characters to form a so-called vector font. The advantage can be infinitely scaled. The disadvantage may be that the drawing logic is more complicated and consumes a lot of resources.

  Fourth, the generation of garbled characters

  With the above background knowledge, it is easy to think of how garbled characters are generated by extension. The essence of garbled characters is that programs such as "Notepad" cannot correctly convert the binary content of the file into characters for drawing.

  It would be much simpler if only ascii character set existed in the world. But because there are so many languages ​​and characters in the world, in addition to English letters, there are Chinese, Greek, Japanese... . These characters are also required to be stored and displayed by the computer. For example, in my country, there is a need to display Chinese, so there will be many character sets, which are usually implemented in countries and regions, so you can worry about your own affairs. Such a definition may then exist.

  As agreed in a certain character set encoding (GBK encoding set)

  Two bytes 0xD6 0xD0 correspond to the character "中"

  We used a powerful "Chinese Notepad" that supports Chinese editing, entered a "中" and saved it. The actual storage content is 0xD6 0xD0. At this time, we use the above "notepad with simple functions" to read and display the file, assuming it only supports the ascii code set, then it will process and display the file content byte by byte, and read the first byte 0xD6 to ascii Find the corresponding characters in the code set for display, and then read 0xD0 for display, so it becomes the so-called "garbled code" as follows. It can be seen that the reason for garbled characters may be that they correspond to wrong characters, or correspond to invisible characters, or how to process and display characters that do not exist at all depends on the processing of the program itself.

  

  Thinking question: But why the actual Windows Notepad can record Chinese, why does it open 0xD6 0xD0 and know that it is saved in GBK code? :-)

  Of course, after a long, long time, the long-term must be combined, and the unicode encoding, that is, Unicode, is naturally produced, which can encode all language symbols in the world with one set of encoding. It avoids the situation where the codes (countries and nationalities) fight for each other.

  V. Summary

  •   The storage and network transmission of computer files are based on binary data streams.
  •   The phenomenon of garbled characters is caused by the inconsistency between the character encoding of the input storage (encoding) and the encoding (decoding) of the read display.
  •   It is necessary to use the same character encoding set for the character->binary->character conversion process to avoid garbled characters.

  Thinking question: How are Strings everywhere in Java stored in memory? What encoding to use?

  For related explanations, see here Java String Encoding : Java String Encoding for Interviews - Uncle Pot - 博客园

Guess you like

Origin blog.csdn.net/qq_25148525/article/details/124472400