Character set and encoding series: Unicode character set

basic concept

 

1. Character set : In order to display text in the computer, all the text must be collected and placed in a table. This table is called a character set (Charset).

2. Code chart : Each text in the character set is assigned a digital number, which is called a code chart. For example, the Chinese character 'Ba' corresponds to 38712 (decimal) or 9738 (hexadecimal) in the code table.

3. Encoding method : With the code table, it is necessary to determine which characters are represented by several bytes, and if there are multiple bytes representing a word, the reading order of the bytes, these are the character encoding methods (Encoding).

Although for the sake of rigor, I explained the character set and the code table separately above, but in fact many character sets also assign a code (Code Point) to each character, so many people often call the character set code table, code table called a character set.

Unicode

At present, the internationally used character set is called Unicode (also known as Universal Code, Unicode), and the characters of all ethnic regions in the world are basically included in it. Its appearance is to solve the problem that different character sets are used in different parts of the world, and then the computer software cannot be compatible and shared, which benefits software developers and facilitates users. The picture below shows the code table of Chinese characters (CJK) in the Unicode character set. Each Chinese character has a number, for example, 4E00 is the number of one.

 

 

The Java language uses the Unicode character set by default.

The Unicode character set includes all the main characters in the world. At the beginning, each character is represented by 2 bytes, with a value range of 0x0000~0xFFFF, and a total of 65536 characters can be represented.

Later, it was found that 2 bytes were not enough, such as uncommon characters, emoji, etc., so the Unicode character set continued to expand, and the character set became larger and larger. The encoding space of the current Unicode character set has been expanded to 3 bytes, the maximum value is 0x10FFFF, and some new characters require 3 bytes to represent.

 Plane and Block

After the expansion of Unicode, in order to facilitate description and uniform distribution, there is a new concept: plane (Plane). In addition to the previous two low bytes of FFFF used to represent the code of the character, the value of the third byte is used to represent the plane number. For example, 0x10 in the Unicode code 0x10FFFF is the plane number. Therefore, the current Unicode has a total of 17 planes from 0 to 16, and each plane can accommodate FFFF characters, so:

Theoretically, the number of characters that Unicode can accommodate 0x10FFFF = number of planes × capacity per plane = 17 × 65536 = 1114112

The plane also includes the concept of block. A block is a space of one byte, which can represent up to 256 characters. A plane is a space of 2 bytes, the maximum FFFF, the low FF is a block, and the high FF is used as the block number. so:

In theory, the number of characters in a plane = FF blocks × FF characters per block = 256 × 256 = 65536 characters

BMP plane

The most important and commonly used plane of Unicode is the first plane, referred to as BMP (Basic Multilingual Plane), numbered 0. In fact, it is the earliest 65535 characters, which already contain the commonly used characters of all nations in the world. Among them, Chinese characters (CJK) occupy the most space. There are 107 blocks in light red from 34 to 9F in the figure below, and each block can represent 256 characters. Therefore,

In theory, the BMP plane contains the number of Chinese characters = 107 × 256 = 27392

 Overall usage of the plane

At present, the character distribution range of 17 planes from 0 to 16 is as follows. It can be seen that planes 4-13 are free, and no characters have been allocated yet. Planes 14~16 are special planes and private planes.

In addition to the 0th plane BMP, planes 1, 2, and 3 have also expanded a lot of characters. In addition to Chinese characters dominating the list on plane 0, they are once again "booked out" on planes 2 and 3. These two planes are basically newly added Chinese characters and rare characters.

 The knowledge about the Unicode character set is introduced here first, and the next article will talk about the char type in Unicode and Java.

 Unicode authoritative data portal:

        Official website user homepage: Unicode official website 

        Official website technology homepage: Unicode Standard 

        The current latest version is version 14.0 released in September 2021: Unicode 14.0.0

        The latest code chart: Unicode 14.0 Character Code Charts

Guess you like

Origin blog.csdn.net/liudun_cool/article/details/120739653