Detailed character set (study, just read one article)

Recently I learned about character set encoding and found that many blog posts are very unfriendly to beginners. Many articles are complicated and difficult to understand. You need to read a lot of articles and query materials to understand the character set and related concepts. This article is dedicated to one article. It allows beginners to have a good understanding of the character set, enriches its theoretical basis for future development projects, and provides ideas for solving BUG.

What is the character set

1. Doubts

I believe that many programmers who do not know enough about character set encoding often have many garbled problems and have many doubts.
For example, why are there garbled characters?
What character encoding is used here?
Is it GBK encoding or UTF-8 encoding to create a text?
(By default, ANSI is used to create text files, which is the system default encoding. Chinese Windows systems use GBK encoding by default.)
Should I use GBK encoding or UTF-8 encoding?
What is the difference between Unicode and UTF-8?
Then after reading this article, I believe these problems can be solved easily.

2. Basic concepts

byte

This is the most basic concept. Byte is a unit of measurement for calculating storage capacity. We know that computers can only recognize binary bits composed of 1 and 0. A number is 1 bit. For the convenience of calculation, we stipulate that 8 bits are a byte .

For example: 00001111this 8-bit binary number occupies one byte of storage capacity.

character

Characters are not the same as bytes. Any word or symbol is a character, but the bytes occupied are not necessarily. Different encodings result in different memory occupied by a character .

For example: a punctuation mark +is one character , a Chinese character 我们is two characters , a Chinese character occupies 2 bytes in GBK encoding, and a Chinese character occupies 3 bytes in UTF-8 encoding.

With the development of the times, programmers hope to display characters in computers , but computers can only recognize binary numbers of 0 and 1. So there is a coding standard.

Coding Standards

The so-called character set is actually a sub-concept in a set of encoding specifications . In order to display characters, international organizations have formulated encoding specifications, hoping to use different binary numbers to represent different characters, so that the computer can display its corresponding according to the binary number character of. We usually call it XX encoding, XX character set.

For example: GBK coding standard , according to this set of coding standard, the computer can convert between Chinese characters and binary numbers. Using GBK encoding can make the computer display Chinese characters.

Next, I will introduce three sub-concepts in the coding standard .

1. Font list

A set of coding standards may not contain all the characters in the world, and each set of coding standards has its own usage scenarios. The font table stores all the characters that can be displayed in the encoding specification. The computer finds the characters from the font table according to the binary number and displays them to the user, which is equivalent to a database storing characters.

For example: Almost all Chinese characters are stored in the font database table of GBK coding standard . Therefore, Chinese characters can be displayed, but French and Russian are not in the font table , so GBK cannot display French, Russian and other characters not included in it.

2. Coded character set (character set)

This is the character set we usually talk about. In a font table, each character has a corresponding binary address, and the coded character set is the collection of these addresses.

For example: In the ASCII code coded character set , the letter Asequence number (address) is 65, 65 is binary 01000001. We can say that the coded character set is used to store these binary numbers. And this is binary coded character set of an element, but it is also the font table in the letters Aaddress. We can display letters based on this address A.

Conclusion: The character set and the font table have a one-to-one correspondence and mutual conversion, which is the key to the computer to recognize characters.

3. Character encoding (encoding method)

After knowing the font table and coded character set, we can directly use the binary address to get the character.

However, it is very wasteful to directly use the binary address corresponding to the character to display the text. The Unicode encoding specification includes millions of characters. If you want to include millions of different characters, you need at least 3 bytes of capacity. For convenience In future expansion, Unicode also reserves more unused space and can store up to 4 bytes.

Therefore, in order to distinguish each character, even if it is a character 00000000 00000000 00000000 00001111that actually only occupies 1 byte, we have to allocate 4 bytes of space for him. This leads to a file that can be saved with 1G. Now 4G is needed. Preservation is extremely wasteful.

So the programmer developed a set of algorithms to save space, and each of the different algorithms is called an encoding method (in the following for ease of understanding, the encoding method will be used to call the character encoding ). A set of coding standards can have many different coding methods , and different coding methods have different adaptation scenarios.

For example: UTF-8 is an encoding method , and Unicode is an encoding standard. In addition, Unicode has two encoding methods , UTF-16 and UTF-32 . Different encoding methods save different space.

Summary: A short binary number is converted into a normal address in the coded character set through an encoding method, and then a corresponding character is found in the font table, and finally displayed to the user.

At this point, everyone should be able to clearly know what a character set is .

Three, a brief introduction to common coding standards

1.ASCII code

ASCII code, is generated by the first coding standard, contains a total of 00000000~ 01111111total 128 characters, can be represented case letters and Arabic numerals, symbols and simple. It can be seen that the ASCII code only needs 1 byte of storage space, and the highest bit is 0. It was later called (American Standard Code for Information Interchange, American Standard Code for Information Interchange). It has no specific encoding method, it is directly represented by the binary number corresponding to the address, if you have to say it is called ASCII encoding method.

2.GBK

The full name of GBK is "Chinese Characters Internal Code Extension Specification", which supports all Chinese, Japanese and Korean Chinese characters in the international standard ISO/IEC10646-1 and the national standard GB13000-1. All characters in the GBK character set occupy 2 bytes, regardless of Chinese or English. There is no special encoding method, it is used to call GBK encoding. Generally used in China when there are many Chinese characters.

3.ISO-8859-1

In addition to the characters included in ASCII, the characters included in ISO-8859-1 also include text symbols corresponding to Western European languages, Greek, Thai, Arabic, and Hebrew. Because the ISO-8859-1 encoding range uses all the space in a single byte, the transmission and storage of any other encoded byte stream in a system that supports ISO-8859-1 will not be discarded. In other words, it is no problem to treat any other encoded byte stream as an ISO-8859-1 encoding. This is a very important feature. The default encoding of the MySQL database is Latin1 to take advantage of this feature. The ASCII code is a 7-bit container, and the ISO-8859-1 code is an 8-bit container.
It can be seen that ISO-8859-1 only occupies 1 byte , and the default encoding of the MySQL database is ISO-8859-1. Sometimes, the tomcat server also uses ISO-8859-1 encoding by default, but ISO-8859-1 does not Supports Chinese , sometimes this is the reason why garbled characters are displayed on the browser.

4.Unicode

It can be seen from the above several coding standards that various coding standards are incompatible with each other and can only express the characters they need. Therefore, the International Organization for Standardization (ISO) decided to develop a set of universal coding standards, which is Unicode.
Unicode contains all the characters in the world. Unicode can store up to 4 bytes of characters. In other words, to distinguish each character, the address of each character requires 4 bytes. This is a very waste of storage space, so programmers designed several character encoding methods, such as: UTF-8, UTF-16, UTF-32.
The most widely used by programmers is UTF-8. UTF-8 is a variable-length character encoding. Note: UTF-8 is not an encoding specification, but an encoding method . Let me introduce the encoding rules of UTF-8.

Encoding rule table
Unicode hexadecimal code point range UTF-8 binary
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

As shown in the above table, for characters that only require 1 byte, UTF-8 uses the ASCII encoding method, and the highest bit is filled with 0 to indicate.

For example: 01000001we are used 01000001to represent, for a byte of character, in fact, directly address represents.

For a character of n bytes (n>1), that is, a character larger than one byte, the first n bits of the first byte are supplemented by 1. The n+1th bit is filled with 0, and the first two bits of the following bytes are all set to 10. The remaining binary bits not mentioned are all unicode codes of this symbol.

For example: the Unicode code of Chinese characters is 4E25 converted into binary, which means that 00000000 00000000 01001110 00100101there are 15 effective digits. According to the above table, it can be seen that the UTF-8 character encoding occupies 3 bytes, so the first 3 digits are 1, the 4th digit (n+1 digits) ) Is 0, and the first two bits of each of the next two bytes are 10, ie 1110 xxxx 10 xxxxxx 10xxxxxx. After filling in, it becomes a 1110 0100 10 111000 10 100101total of 24 bits and 3 bytes.

It can be seen that English only occupies 1 byte after UTF-8 character encoding, and Chinese occupies 3 bytes.
Although UTF-8 encoding does not occupy less space than GBK encoding, it is better for the whole world. As for which encoding is used, it depends on the specific use environment.

Fourth, encoding and decoding

decoding

A string of binary numbers is converted into characters using an encoding method. This process is called decoding . Just like unlocking a password, programmers can choose any encoding method to decode , but often there is only one encoding method that can unlock the password to display the correct text, and using the wrong encoding method will produce other unreasonable characters. It's what we usually say--garbled!

coding

A string of decoded characters can also be converted into a string of binary numbers using any type of encoding. This process is encoding . We can also call it an encryption process. No matter which encoding method is used for encoding , it will eventually be It is to generate a binary number that can be recognized by the computer, but if the font table of the coding standard does not contain the target character, the corresponding binary number cannot be found in the character set. This will cause irreversible garbled! For example, a font table like ISO-8859-1 does not contain Chinese, so even if Chinese characters are encoded using ISO-8859-1 and then decoded using ISO-8859-1, correct Chinese characters cannot be displayed.

From the above, everyone can understand that garbled is the result of inconsistent encoding methods used in encoding and decoding , or the corresponding characters are not included in the font table during encoding .

Five, code demonstration

With java language as an example, we first string of Chinese string uses UTF-8 encoding method encodes become a byte array, then an array of bytes printed.

String chinese="汉";
//使用UTF-8编码方式进行编码。
byte[] bs = chinese.getBytes("UTF-8");
for (byte b : bs) {
    
    
	System.out.print(b+" ");
}

result:

-26 -79 -119

It can be seen that a Chinese character becomes 3 bytes, which proves that a Chinese character occupies 3 bytes in UTF-8 encoding.

Let's go on to decode the byte array using UTF-8 encoding .

//使用UTF-8编码方式进行解码。
String utf8 = new String(bs,"UTF-8");
System.out.println(utf8);

result:

Chinese characters are displayed correctly after decoding .
But if we use GBK for decoding .

//使用GBK编码方式进行解码。
String gbk = new String(bs, "GBK");
System.out.println(gbk);

result:

?

Explain that using the wrong decoding method will cause messy codes.

But if Chinese characters are encoded using ISO-8859-1 and then decoded

String chinese = "我是帅哥";
//使用ISO-8859-1编码方式进行编码。
byte[] bs = chinese.getBytes("ISO-8859-1");
for (byte b : bs) {
    
    
	System.out.println(b + " ");
}
//使用ISO-8859-1编码方式进行解码。
String iso = new String(bs, "ISO-8859-1");
System.out.println("\n"+iso);

result:

63 63 63 63 
????

It can be seen that no matter which Chinese character is, it becomes 63 after using ISO-8859-1 encoding . As a result, the computer can no longer recognize which character it is.
Even if ISO-8859-1 is used for decoding , Chinese characters cannot be displayed correctly.
After reading this article, I recommend that you conduct various experiments on your own to consolidate and understand the content. I hope it will be helpful to everyone. We will continue to update new knowledge points in the future.

Does it help you? Like it~

Please indicate the source for reprinting: https://blog.csdn.net/qq_42068856/article/details/83792174

Guess you like

Origin blog.csdn.net/qq_42068856/article/details/83792174