The coded character set and the character set encoding are stupidly unclear! Do you understand after reading this article?

The previous article " Why is String designed to be final, and how to design an immutable class?" "Leaving an encoding-related problem, the theory in Java is that a character (Chinese alphabet) occupies two bytes. But in UTF-8, new String("word").getBytes().length returns 3, which means 3 bytes. Do you know why? How many bytes does char occupy in Java?

Before answering this question, let's learn a little basics.

What is a character set? What is an encoding?

Character (Character) is the general term for text and symbols, including text, graphic symbols, mathematical symbols, etc. A collection of abstract characters is a character set ( Charset ).

The word "abstract" is used because the characters mentioned here are characters that do not have any concrete form. For example, the character "Han" is seen in the article. This is actually a specific form of expression of this character, and it is its image form. When people read the word "Han", they use What is more is another specific form of expression---sound. But in any case, these two forms of expression both refer to the "Chinese" character. There may be countless forms of expression for the same character (dot matrix method, vector method, audio, etc.), and the same A character is included in the character set, which will make the set too large. Therefore, the characters in the abstract character set refer to the only existing abstract characters, ignoring its specific form of expression. After each character in an abstract character set is assigned an integer number, the character set has an order and becomes a coded character set. At the same time, this number can uniquely determine which character it refers to. For the same character, the integer numbers specified by different character set encoding systems are also different. For example, the character "儿" is numbered 0x513F in Unicode, which means it is the 0X513F character in the coded character set of Unicode. In another coded character set, this word is 0xA449.

A coded character set refers to the set of characters assigned an integer number, but the integer number assigned to a character in a coded character set is not necessarily the value used when the character is stored in the computer. The characters stored in the computer are in the end What binary integer value to use is determined by the character set encoding .

The character set encoding determines how to map a character integer number to a binary integer value. In almost all character set encodings of English characters, the integer numbers of English letters are consistent with the binary form stored in the computer. However, in some encoding methods, such as the UTF-8 encoding form applicable to the Unicode character set, the integer numbers of a large part of characters are converted and stored in the computer. For example, the Unicode value of "Chinese" is 0x6C49, but the value after its encoding format is UTF-8 format is 0xE6B189 (3 bytes).

Each character in the coded character set corresponds to a unique code value. These code values ​​are called code points (code point), which can be regarded as the serial number of the character in the coded character set. The character in a given encoding method The following binary bit sequence is called code element (code unit).

Note: We introduce two concepts here, code point and code unit.

Why distinguish between the two concepts of character set and encoding?

In the early days, character sets and encodings were one-to-one. There are many character encoding schemes, and there is only one encoding implementation for a character set, and the two are in one-to-one correspondence. For example, GB2312, in this case, no matter how you call them, such as "GB2312 encoding", "GB2312 character set", it is actually the same thing after all, maybe it does not make a special distinction, so no matter how Can't be wrong.

When it comes to Unicode, it becomes different. The only Unicode character set corresponds to three encodings: UTF-8, UTF-16, and UTF-32. Concepts such as character sets and encodings are completely separated and modularized. In fact, it was only in the Unicode era that they were widely recognized.

1) charset is the abbreviation of character set, that is, character set.

2) encoding is the abbreviation of charset encoding, that is, character set encoding, referred to as encoding.

From the figure above, it can be clearly seen that

1. The encoding depends on the character set, just like the implementation of the interface in the code depends on the interface;

2. A character set can have multiple encoding implementations, just like an interface can have multiple implementation classes.

Why is Unicode so special?

To come up with a new character set standard is nothing more than that the characters in the old character set are not enough.

The goal of Unicode is to unify all character sets and include all characters, so there is no need to adjust any new character sets.

But what if you think its existing encoding scheme is not very good? In the case of not being able to create a new character set, we can only make a fuss about encoding, so there are multiple implementations, so that the traditional one-to-one correspondence is broken.

As can be seen from the above figure, due to historical reasons, you will also see the situation where Unicode and UTF-8 are mixed together in many places. In this case, Unicode is usually UTF-16 or earlier UCS -2 encoding.

We have said a lot of Unicode now, and for various reasons, it must be admitted that the word "Unicode" has different meanings in different contexts. It may refer to:

1) Unicode standard

2) Unicode character set

3) The abstract encoding (number) of Unicode, that is, code point (code point)

4) A specific encoding implementation of Unicode, usually the variable-length UTF-16, or the earlier fixed-length 16-bit UCS-2.

Here we focus on the UTF-16 encoding. UTF-16 maps the code points of the Unicode character set to a sequence of 16-bit integers (ie, code units, with a length of 2 Byte) for data storage or transmission. The code point of a Unicode character requires one or two 16-bit code units to represent, so this is a variable-length representation.

UTF-16 can be regarded as a superset of UCS-2. Before there is no auxiliary plane character (the basic idea is to use two 16-bit codes to represent a character, only for characters exceeding 65535), UTF-16 and UCS-2 refer to the same meaning. When the auxiliary plane characters were introduced, it became known as UTF-16.

Now if some software claims to support UCS-2 encoding, it actually implies that it cannot support character sets exceeding 2 bytes in UTF-16. For UCS codes less than 0x10000, UTF-16 encoding is equal to UCS codes.

Why should we focus on UTF-16 encoding, because Java's internal code uses UTF-16 encoding, which is what we often call Unicode encoding.

I didn’t expect to write so long. I just introduced the difference between character sets and encodings. It seems that I have to divide them into two articles to answer the questions left in the previous article. The summary of this article is actually two sentences:

The sequence specified by each character in the coded character set is called a code point (code point) , and the serial number of this character in the coded character set, and the binary sequence under a given encoding method is called a code unit (code unit) .

In the Java world, we have more access to external codes, that is, the character encodings used externally when the program interacts with the outside world, and there are more things you don’t know. I look forward to our official entry into the Java coding world in the next issue, and finally answer the previous article that question.

Guess you like

Origin blog.csdn.net/chuixue24/article/details/130348165