The relationship between character encoding and character set in computer

What is a character set
Before introducing the character set, let us first understand why there is a character set. What we see on the computer screen is the substantive text, and what is stored in the computer storage medium is actually a binary bit stream. Then the conversion rules between the two need a unified standard, otherwise the documents on the boss's computer will be garbled when our U disk is checked, and the files uploaded by our friend QQ will be garbled when opened locally. (PS: The native English term for garbled under popular science here is mojibake). So in order to realize the conversion standard, various character set standards have appeared. Simply put, the character set stipulates the conversion relationship between the binary number storage method (encoding) corresponding to a certain character and the character (decoding) represented by a certain string of binary values.
So why are there so many character set standards? This question is actually very easy to answer. Ask yourself why our plug can't be used when we get it in the UK? Why does the monitor have so many interfaces at the same time as DVI, VGA, HDMI, and DP? Many norms and standards did not realize that they would be universally applicable in the future when they were first formulated, or they wanted to be fundamentally different from existing standards in the interests of the organization itself. As a result, there are so many standards that have the same effect but are not compatible with each other.
Having said so much, let’s take a look at a practical example. The following is the hexadecimal and binary encoding results of the word dio in various encodings. Do you have a very distressed feeling?

Character set hexadecimal encoding
UTF-8 0xE5B18C
UTF-16 0x5C4C
GBK 0x8CC5

What is a character encoding? A
character set is just the name of a set of rules. Corresponding to real life, a character set is a name for a certain language. For example: English, Chinese, Japanese. And how to use English to express your ruthless meaning is what the English lexical grammar needs to describe in detail. For a character set, three key elements are needed to correctly encode and transcode a character: character repertoire, coded character set, and character encoding form. The font table is a database equivalent to all readable or displayable characters. The font table determines the range of all characters that the entire character set can display. Coded character set, that is, a code point is used to indicate the position of a character in the font. Character encoding, the conversion relationship between the encoded character set and the actual stored value. Generally speaking, the value of the code point will be directly stored as the encoded value. For example, in ASCII, A is ranked 65th in the table, and the value of A after encoding is 0100 0001, which is the binary conversion result of decimal 65.
Seeing this, many readers may have the same questions as I did: The font table and coded character set seem to be indispensable. Since each character in the font table has its own serial number, directly use the serial number as Just store the content. Why do you need to use character encoding to convert the serial number into another storage format? In fact, the reason is easier to understand: the purpose of the unified font table is to be able to cover all the characters in the world, but in actual use, you will find that the proportion of the characters actually used is very low compared to the entire font table. For example, programs in the Chinese region hardly need Japanese characters, and some English-speaking countries even simple ASCII font tables can meet the basic needs. And if each character is stored with the serial number in the font table, each character needs 3 bytes (here, the Unicode font library is used as an example), so for the English-speaking countries that originally used the ASCII code that only occupied one character Obviously an additional cost (the storage volume is three times the original). It is more straightforward. The same hard disk can store 1500 articles in ASCII, but only 500 articles can be stored in 3-byte Unicode serial number storage. Thus, variable-length encoding such as UTF-8 appeared. Originally only one byte of ASCII characters were needed in UTF-8 encoding, but still only occupies one byte. However, complex characters like Chinese and Japanese require 2 to 3 bytes to store.

Character encoding method
ASCII code
We know that all information inside the computer is ultimately a binary value. Each binary bit (bit) has two states, 0 and 1, so eight binary bits can be combined into 256 states, which is called a byte. In other words, a byte can be used to represent a total of 256 different states, and each state corresponds to a symbol, that is, 256 symbols, ranging from 00000000 to 11111111.
In the 1960s, the United States formulated a set of character codes to uniformly regulate the relationship between English characters and binary digits. This is called the ASCII code and is still in use today.
The ASCII code specifies a total of 128 characters. For example, the space SPACE is 32 (binary 00100000), and the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed) occupy only the last 7 bits of a byte, and the first bit is uniformly defined as 0.
Non-ASCII code
English encoding with 128 symbols is enough, but to represent other languages, 128 symbols are not enough. For example, in French, if there is a phonetic symbol above the letter, it cannot be represented by ASCII code. As a result, some European countries decided to use the highest bit of the unused byte to program a new symbol. For example, the code of é in French is 130 (binary 10000010). In this way, the coding system used in these European countries can represent up to 256 symbols.
However, a new problem has arisen here. Different countries have different letters, so even if they all use 256 symbol encoding methods, they represent different letters. For example, 130 represents é in the French encoding, but it represents the letter Gimel (ג) in the Hebrew encoding, and it represents another symbol in the Russian encoding. But anyway, in all these encoding methods, the symbols represented by 0-127 are the same, and the only difference is the segment of 128-255.
As for the scripts of Asian countries, more symbols are used, with as many as 100,000 Chinese characters. One byte can only represent 256 kinds of symbols, which is definitely not enough. You must use multiple bytes to represent one symbol. For example, the common encoding method for simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so in theory, it can represent up to 256 x 256 = 65536 symbols.
The issue of Chinese encoding needs to be discussed in a special article, which is not covered in this note. It is only pointed out here that although multiple bytes are used to represent a symbol, the Chinese character encoding of the GB type has nothing to do with the Unicode and UTF-8 below.
Unicode
As mentioned in the previous section, there are multiple encoding methods in the world, and the same binary number can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding method, otherwise garbled characters will appear if you decode it with the wrong encoding method. Why do emails often appear garbled? It is because the coding method used by the sender and the recipient is different.
It is conceivable that if there is a code, all the symbols in the world are included. Each symbol is given a unique code, then the garbled problem will disappear. This is Unicode, as its name implies, this is an encoding of all symbols.
Unicode is of course a large collection, and the current scale can hold more than 1 million symbols. The encoding of each symbol is different. For example, U+0639 represents the Arabic letter Ain, U+0041 represents the English capital letter A, and U+4E25 represents the Chinese character strict. For the specific symbol correspondence table, you can query unicode.org or the special Chinese character correspondence table.
It should be noted that Unicode is only a symbol set, it only specifies the binary code of the symbol, but does not specify how the binary code should be stored.
For example, the strict Unicode of Chinese characters is the hexadecimal number 4E25, which is converted into a binary number with 15 bits (100111000100101), that is to say, the representation of this symbol requires at least 2 bytes. Represents other larger symbols, which may require 3 bytes or 4 bytes, or even more.
There are two serious questions here. The first question is, how can we distinguish between Unicode and ASCII? How does the computer know that three bytes represent a symbol instead of three symbols separately? The second problem is that we already know that English letters are only represented by one byte. If Unicode stipulates that each symbol is represented by three or four bytes, then there must be two before each English letter. Up to three bytes is 0, which is a huge waste of storage. The size of the text file will therefore be two or three times larger, which is unacceptable.
The result of them is: 1) There are multiple storage methods of Unicode, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long time until the emergence of the Internet. The popularity of the
UTF-8
Internet strongly requires a unified encoding method. UTF-8 is the most widely used implementation of Unicode on the Internet. Other implementations include UTF-16 (characters are represented by two bytes or four bytes) and UTF-32 (characters are represented by four bytes), but they are basically not used on the Internet. To repeat, the relationship here is that UTF-8 is one of the implementations of Unicode.
One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1 to 4 bytes to represent a symbol, and the byte length varies according to different symbols.
UTF-8 encoding rules are very simple, there are only two:
1) For single-byte symbols, the first bit of the byte is set to 0, and the following 7 bits are the Unicode code of this symbol. Therefore, for English letters, UTF-8 encoding and ASCII code are the same.
2) For n-byte symbols (n> 1), the first n bits of the first byte are all set to 1, the n + 1 bit is set to 0, and the first two bits of the following bytes are set to 10. The remaining binary bits not mentioned are all the Unicode code of this symbol.

A character set is just the name of a rule set,
character set = character repertoire, coded character set, character encoding form.
2. Font table: The
font table is a database equivalent to all readable or displayable characters. The font table determines the range of all characters that the entire character set can display.
3. Coded character set: (abbreviated as character set, such as Unicode, ASCII)
coded character set, a code point is used to represent a character (that is, the position of the character in the sub-library table), this value is called the character corresponding to The serial number of the coded character set (such as: Unicode, ASCII).
4. Character encoding:
Character encoding is the conversion relationship between the coded character set and the actual stored value. Characters are converted into a binary value and stored in the computer according to the character encoding scheme.
Therefore, character encoding is a mapping rule defined on the character set. (Character-------->the actual stored value in the computer)

Note: The coded character set Unicode, there are multiple character encodings such as UTF-8, UTF-16, UTF-32, etc. The
coded character set ASCII, which is itself a coded character set, is also a character
coded character set CB2312, which is only one of EUC-CN Character Encoding

Guess you like

Origin blog.csdn.net/qq_41526316/article/details/87922342