Understand the encoding rules of UTF-8 in one article

I wrote an article before, " Completely Understand Computer Chinese Encoding" which only introduced the knowledge of GB2312 encoding and did not cover utf8. Yes, after querying the data, I found that utf8 is a variable-length character encoding for unicode, so I will record it again.
Insert image description here
The current national standard for the Chinese coded character set for information technology is "GB 18030-2022 Information Technology Chinese Coded Character Set"

First of all, it is necessary to clarify that GB 18030 is a character set, which defines which Chinese characters need to be displayed in the computer system. UTF-8 is an encoding method that defines how to display it in the computer system.
Let’s take the Chinese word “Dad” as an example to see how it is defined in GB 18030, as shown below:
Insert image description here
B0: first byte
D: High bit of the second byte
6: Low bit of the second byte
Dad: glyph
7238 (16 Base): GB/T 13000 code position

GB13000 Full name: National Standard GB13000: 2010 "Universal Multi-octet Coded Character Set (UCS) for Information Technology Part 1: Architecture and Basic Multilingual Plane", this standard is equivalent to the international standard ISO/IEC 10646- 2003, IDT "Information Technology Universal Multi-octet Coded Character Set (UCS) Part 1: Architecture and Basic Multilingual Plane". The Unicode standard is consistent with GB 13000 on the basic plane. The UTF-16 scheme is adopted as the way to implement a total of 15 auxiliary planes from 01 to 0F in the future. Other aspects are basically the same as GB 13000.
In order to facilitate the simultaneous processing of multiple languages, the Coded Character Set Working Group under the International Organization for Standardization developed a new coded character set standard, ISO/IEC 10646. The standard was first promulgated in 1993. At that time, only the first part was promulgated, namely ISO/IEC 10646.1: 1993. The corresponding national standard in China is GB 13000.1-93 "Universal multi-octet coded character set (UCS) for information technology". Part One: Architecture and the Basic Multilingual Plane”. The purpose of formulating this standard is to uniformly encode all characters in the world so that all characters in the world can be processed uniformly on computers.

UTF-8 encoding rules are as follows:
Utf8 is divided into single-byte, double-byte, three-byte, and four-byte modes, as follows:
0xxxxxxx(7bit)
110xxxxx 10xxxxxx(11bit)
1110xxxx 10xxxxxx 10xxxxxx(16bit)中文
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx( 21bit)

Chinese uses three-byte mode, so the Chinese "Dad" is converted as follows:
7238 binary: 0111 0010 0011 1000
according to three-byte Pattern encoding: 1110 0111 1000 1000 1011 1000
After conversion to hexadecimal: E 7 8 8 B 8

Program verification:

public class GBKTest {
    
    
    public static void main(String[] args) throws UnsupportedEncodingException {
    
    
        String nh = "爸";

        byte[] bs = nh.getBytes("utf-8");
        for (int i=0;i<bs.length;i++) {
    
    
            int n = bs[i];
            if (n < 0)
                n += 256;
            int d1 = n / 16;
            int d2 = n % 16;
            System.out.println(hexDigits[d1] + " " + hexDigits[d2]);
        }

    }

    private static final String hexDigits[] = {
    
     "0", "1", "2", "3", "4", "5",
            "6", "7", "8", "9", "a", "b", "c", "d", "e", "f" };
}

Guess you like

Origin blog.csdn.net/lzx5290/article/details/133572713