Encoding character set Unicode and character encoding UTF-8, UTF-16, UTF-32

Everyone should have come into contact with these terms many times, but I have never really studied them. I checked the information today and I probably figured out the relationship between these things.

Main reference materials: Document 1 , Document 2

First, we must understand three concepts: character set, code point, coded character set, character encoding

1. Three concepts

1.1 Character set (character set)

The set of characters we select constitutes a character set, for example, all English letters and all Chinese characters constitute a character set.

1.2 code point or code position

Number each character in the character set, and the number value corresponding to each character is its code point

1.3 Coded character set (coded character set)

Each character in the character set is numbered, and the character set after the number is called the coded character set. For example, common coded character sets are ASCII, Unicode, GBK, etc.

1.4 Character encoding form

Character encoding is to convert code points in a character set into a sequence of binary bits that can be stored in the computer according to certain rules, which is to establish a conversion or mapping relationship.

So isn't our previous coded character set a mapping relationship? It is true, but when stored in the computer, it is not necessarily stored directly according to its serial number value (code point). Even a character set corresponds to multiple encoding methods (such as Unicode). Of course, there are also direct serial number values (code points). ) Directly stored as a character code (such as ASCII)

2. Specific application

2.1 Unicode

The Unicode character set is formulated by the Unicode Consortium, which is composed of multilingual software manufacturers, and the ISO-10646 working group of the International Organization for Standardization. It specifies a uniform and unique code point for each character in various languages to meet the Language, cross-platform conversion and processing text requirements.

Unicode code point in the range of 0x0 _{0x10FFFF, a total of 1,114,112 code points, divided into a number 0} 16 17 characters plane , each plane including 65536 code points (i.e., two bytes) . Where the number of the most commonly used plane 0 is called Basic Multilingual Plane (Basic Multilingual Plane, BMP); other planes are referred to as secondary language plane.

Unicode is usually U+xxxxrepresented by (x hexadecimal)

With the coded character set, we need to store characters in the computer. There are three specific character encoding schemes UTF (unicode transformation format) for storing Unicode in the computer:

UTF-8 (variable length encoding)
UTF-16 (variable length encoding)
UTF-32 (equal length encoding)

2.2 UTF-32

How to save Unicode on the computer?

Of course, the most straightforward idea is to wait for job conversion. Each character is completely stored in the computer with its code point value (just like ASCII), so there is UTF-32 (which means that all 1114112 code points require 4 characters) Section), but its space efficiency is low (many characters are used less), so the application is not very wide.

2.3 UTF-16

Earlier we mentioned that the Unicode basic multilingual plane (most commonly used) contains 65536 code points, which can be stored in only 2 bytes, so the variable length encoding UTF-16 was born, which is represented by 1 or 2 code units

For the U+0000_{`U + D7FF` from` U + E000` and}U+FFFF code points: the code point directly converted to the equivalent 16-bit binary sequence
For the U+10000~ U+10FFFFcode point, be subtracted 0x10000to give 0x00000~ 0xfffff20bit long sequence, and then divided into high and low 10bit 10bit, then, the first one coding unit = high 10bit + 0xD800, the second encoding unit = low 10bit + 0xDC00 (In fact, it is to add a prefix before the high 10bit and the low 10bit)

2.4 UTF-8

For some European and American countries whose native language is English, most of the characters used can be found in the ASCII code, so UTF-16 is too bloated for them (the space utilization is not high), so there is UTF-8 , The first byte of UTF-8 is compatible with ASCII code.

UTF-8 code unit is one byte, there are three variable length forms

0xxxxxxx, if it is such a 01 string, it means that it starts with 0 and what comes after it does not matter. XX represents any bit. It means that a byte is a unit. It is exactly the same as ASCII
110xxxxx 10xxxxxx. If it is in this format, treat two bytes as a unit
1110xxxx 10xxxxxx 10xxxxxx If this format is three bytes as a unit

For people in Europe and America, they will choose UTF-8 without hesitation, but we will hesitate for a while. We may need to occupy 3 bytes for many characters (compared to UTF-16 storage efficiency is reduced), On the other hand, due to its variable-length byte representation, it is not efficient in calculating the number of characters and performing indexing, so we may use UTF-16 or UTF-32 more often