One article to solve the little secret of string encoding in java

Introduction

In this article, you will learn about the relationship between Unicode and UTF-8, UTF-16, and UTF-32. You will also learn about the variant UTF-8, and discuss the application of UTF-8 and variant UTF-8 in java.

Let's take a look.

The history of Unicode

A long time ago, a high-tech product called a computer appeared in the Western world.

The first generation of computers could only do some simple arithmetic operations, and they had to use manual punching programs to run. However, as time went by, the size of computers became smaller and more powerful, and punching no longer existed. , Has become a computer language written manually.

Everything is changing, only one thing remains unchanged. This incident is that computers and programming languages ​​are only circulated in the West. However, 26 letters plus limited punctuation is enough for daily communication in the West.

The initial computer storage can be very expensive. We use one byte, which is 8bit, to store all the characters that can be used. Except for the first 1bit, there are a total of 128 options, including 26 lowercase + 26 uppercase letters. And some other punctuation marks are enough.

This is the original ASCII code, also known as the American Standard Code for Information Interchange.

After computers spread to the world, people realized that the previous ASCII encoding was not enough. For example, there are more than 4,000 Chinese characters commonly used in Chinese. What should I do?

It doesn't matter, the localization of ASCII encoding is called ANSI encoding. If 1 byte is not enough, just use 2 bytes. The way is for people to walk out, and coding is for people. As a result, various encoding standards such as GB2312, BIG5, JIS, etc. have been produced. Although these codes are compatible with ASCII codes, they are not compatible with each other.

This has seriously affected the process of internationalization, so how can we realize the dream of the same earth and the same homeland?

So international organizations took action and formulated the UNICODE character set, which defines a unique encoding for all characters in all languages. The unicode character set is from U+0000 to U+10FFFF.

So what is the relationship between unicode and UTF-8, UTF-16, UTF-32?

The unicode character set is finally stored in a file or memory. If it is stored directly, it takes up too much space. How to save it? Should we use fixed 1 byte, 2 bytes or variable length bytes? So we are divided into UTF-8, UTF-16, UTF-32 and other encoding methods according to different encoding methods.

Among them, UTF-8 is a variable-length encoding scheme, which uses 1-4 bytes to store. UTF-16 uses 2 or 4 bytes to store, and the underlying encoding of String after JDK9 has become two: LATIN1 and UTF16.

UTF-32 uses 4 bytes for storage. Among the three encoding methods, only UTF-8 is compatible with ASCII, which is why UTF-8 encoding methods are more common in the world (after all, computer technology is developed by Westerners).

Unicode explained

After knowing the development history of Unicode, let's explain in detail how Unicode is encoded.

The Unicode standard has been released from version 1.0 in 1991 to the latest version 13.0 in March 2020.

The range of character strings that Unicode can represent is from 0 to 10FFFF, expressed as U+0000 to U+10FFFF.

Among them, the characters U+D800 to U+DFFF are reserved for UTF-16, so the actual number of Unicode characters is 216 − 211 + 220 = 1,112,064.

We divide these Unicode character sets into 17 planes, and the distribution diagram of each plane is as follows:

Take Plan 0 as an example. Basic Multilingual Plane (BMP) basically contains most of the commonly used characters. The following figure shows the corresponding characters represented in BMP:

We mentioned above that U+D800 to U+DFFF are reserved characters of UTF-16. The high U+D800–U+DBFF and low U+DC00–U+DFFF are used as a pair of 16 bits to encode non-BMP characters in UTF-16. A single 16bits is meaningless.

UTF-8

UTF-8 uses 1 to 4 bytes to represent all 1,112,064 Unicode characters. So UTF-8 is a variable length encoding method.

UTF-8 is currently the most common encoding method in the Web. Let's see how UTF-8 encodes Unicode:

The first byte can represent 128 ASCII characters, so UTF-8 is compatible with ASCII.

The next 1,920 characters require two bytes to encode, covering almost all the rest of the Latin alphabet, as well as Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syrian Language, Thaana and N'Ko letters, and combined diacritics marks. The characters in the rest of the BMP require three bytes, which contain almost all commonly used characters, including most Chinese, Japanese and Korean characters. Characters in other planes in Unicode require four bytes, including the less common CJK characters, various historical scripts, mathematical symbols and emoji (pictograms).

The following is an example of a specific UTF-8 encoding:

UTF-16

UTF-16 is also a variable-length encoding method. UTF-16 uses 1 to 2 16bits to represent the corresponding characters.

UTF-16 is mainly used internally in Microsoft Windows, Java and JavaScript/ECMAScript.

However, the usage rate of UTF-16 on the web is not high.

Next, let's take a look at how UTF-16 is encoded.

First of all: U+0000 to U+D7FF and U+E000 to U+FFFF, the characters in this range are directly represented by 1 16bits, which is very intuitive.

Next: U+010000 to U+10FFFF

The characters in this range are first subtracted from 0x10000 and become 0x00000-0xFFFFF represented by 20bits.

Then the high 10bits 0x000-0x3FF plus 0xD800 becomes 0xD800-0xDBFF, which is represented by 1 16bits.

The lower 10 bits of 0x000–0x3FF plus 0xDC00 become 0xDC00–0xDFFF, which is represented by 1 16bits.

U' = yyyyyyyyyyxxxxxxxxxx  // U - 0x10000
W1 = 110110yyyyyyyyyy      // 0xD800 + yyyyyyyyyy
W2 = 110111xxxxxxxxxx      // 0xDC00 + xxxxxxxxxx

This is why 0xD800–0xDFFF is a UTF-16 reserved character in Unicode.

The following is an example of UTF-16 encoding:

UTF-32

UTF-32 is a fixed-length encoding, and each character needs to be represented by one 32bits.

Because it is 32bits, UTF-32 can be directly used to represent Unicode characters. The disadvantage is that UTF-32 takes up too much space, so in general, few systems use UTF-32.

Null-terminated string and variant UTF-8

In the C language, a string ends with a null character ('\0') NUL.

So in this kind of character, 0x00 cannot be stored in the middle of String. So what if we really want to store 0x00?

We can use variant UTF-8 encoding.

In the variant UTF-8, the null character (U+0000) is represented by two bytes: 11000000 10000000.

So the variant UTF-8 can represent all Unicode characters, including the null character U+0000.

Generally speaking, in java, InputStreamReader and OutputStreamWriter use standard UTF-8 encoding by default, but string constants in object serialization and DataInput, DataOutput, JNI and class files are all variants of UTF-8. Expressed.

This article has been included in http://www.flydean.com/java-string-encodings/

The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!

Welcome to pay attention to my official account: "programs those things", know technology, know you better!

Guess you like

Origin blog.csdn.net/superfjj/article/details/108550193