Analysis of java character encoding

This article is actually about a problem from the beginning: java char type can be stored in Chinese characters it?

UTF-8 encoding

UTF-8 is the most widely used implementation using a Unicode on the Internet. Other implementations further comprising UTF-16 (character two bytes or four bytes), and UTF-32 (four bytes represented by character), but substantially not on the Internet. Repeat, here is the relationship, UTF-8 Unicode is one of implementation. UTF-8 biggest feature is that it is a variable length encoding. It can be 1 to 4 bytes of one symbol, byte length varies depending on the symbol.
UTF-8 encoding rules are very simple, only two:

1. For single byte symbols, the first byte is set to 0, the back 7 of the symbol codes to unicode. Therefore, for the English alphabet, UTF-8 encoding and ASCII codes are the same.
2. For the sign of n bytes (n> 1), the first n bits of the first byte are set to 1, the n + 1 bit is set to 0, the first two bytes of the rear set 10 uniformly. The remaining bits not mentioned, all this unicode code symbol.
The following table summarizes the encoding rules, the letter x represents available encoding bits.

Unicode symbol range (hex) UTF-8 encoding (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

It is now on the table, reading UTF-8 encoding is very simple. If the first byte is 0, then this is a single-byte character; if the first bit is 1, the number of consecutive 1, it indicates how many bytes occupied by the current character.
Below, or to the Chinese character "strict" for example, demonstrates how to implement UTF-8 encoding.
Known "strict" unicode is 4E25 (100111000100101), according to the table, can be found in the range of 4E25 third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes that the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, from the "strict" last bit Start, fill in the format of x from back to front, the extra bit 0s. This resulted in a "strict" UTF-8 encoding is "11100100 1,011,100,010,100,101", converted to hexadecimal is E4B8A5.

UTF-16 encoding

Unicode UTF-16 is used in one embodiment. UTF is Unicode TransferFormat, that is the meaning of Unicode to work as a format. UTF-16 than UTF-8, most of the benefits that character byte (2 bytes) to store fixed length, but can not compatible with UTF-16 to ASCII. Large and small ends of the storage form of UTF-16 are used. In order to clarify the size of the file end UTF-16, UTF-16 in the beginning of the file will be placed as a character U + FEFF Byte Order Mark (UTF-16LE to FF FE representatives, UTF 16BE-to FE FF representative) to the text file is displayed UTF-16 code, where the meaning of U + FEFF UNICODE character is represented in the ZERO wIDTH nO-BREAK SPACE, as the name implies, it is a no width and no hyphenation blank.

Examples to explain

Example 1

1
2
3
4
5
6
7
8
9
10
11
12
String s = "I'm 李博玉";
byte[] charArr = s.getBytes(Charset.forName("UTF-16"));
for (byte b : charArr) {
System.out.printf("%X ", b);
}
System.out.println(s.getBytes(Charset.forName("UTF-16")).length);

charArr = s.getBytes(Charset.forName("UTF-8"));
for (byte b : charArr) {
System.out.printf("%X ", b);
}
System.out.println(s.getBytes(Charset.forName("UTF-8")).length);

What is the result of output is it?
The FF 49 0 27 0 the FE 6D 0 0 67 4E 20 is 73 is 53 is. 5A 89
16
49 27 6D 20 is 8D. 9A E5 E6 E7 8E 9D 89 8E
13 is

Why coding 1.UTF-16 is 16?
For most of the characters speaking, UTF-16 uses two bytes to store. However, UTF-16 is the size of the support side, it is necessary to specify additional space in the first character two bytes of its good endian, FE FF indication big-endian storage.
7 x 2 + 2 = 16
coding is why 2.UTF-8 13?
UTF-8 is fully compatible with ASCII encoding, it is one byte in English, Chinese most 3 bytes, not four bytes common
4 + 3 x 3 = 13

Example 2

String s1 = “李”;
String s2 = “
System.out.println(s1.length());
System.out.println(s2.length());

What is the result of output is it?
1
2
see this result is not collapsed, how matter in the end?

1. The first thing to understand what is meant by .length, is stored in the form of an internal char array of String, .length refers to the length of the char array, char is UTF-16 encoded Lee is a commonly used word, UTF-16 encoding after two bytes, it is possible to store a char, so the length is 1, while the UTF-16 encoding a total of four bytes, two char stored, the length is 2

Original: Big Box  java character encoding Analysis


Guess you like

Origin www.cnblogs.com/chinatrump/p/11597139.html