Chinese encoding problem in Java Web (Part 1)

Because Java is a cross-platform language, there are many switching between coding on different platforms, so coding problems in java often appear.

Why encode?

★The smallest unit of information stored in a computer is a byte, namely 8bit, so the characters that can be represented are 0~255.

★ There are too many languages ​​and symbols that people need to express, and they cannot be completely expressed in one byte. Therefore, we must go through "split" or some "translation" work to make the computer understand our language.

To resolve this contradiction, there must be a new data structure char, from char to byte must be encoded.


Several common encoding formats:

1.ASCII code

There are 128 in total, the lower 7 bits of 1 byte represent, 0~31 are control characters such as line feed, carriage return, delete, etc., 32~126 are printing characters, which can be displayed by keyboard input.

2.ISO-8859-1

Extended ASCII, ISO-8859-1 to ISO-8859-15, ISO-8859-1 covers most Western European language characters, so it is the most widely used. The ISO-8859-1 encoding is a single-byte encoding, which is downward compatible with ASCII. Its encoding range is 0x00-0xFF. The range of 0x00-0x7F is completely consistent with ASCII. The range of 0x80-0x9F is control characters, and the range of 0xA0-0xFF is The text symbol can represent 256 bits in total.

3.GB2312

"Information Technology Chinese Coded Character Set", double-byte encoding, range A1~F7, where A1~A9 are the symbol area, a total of 682 symbols; B0~F7 are the Chinese character area, containing 6763 Chinese characters.

4.GBK

"Chinese Characters Internal Code Extension Specification", the coding range is 8140~FEFE (without XX7F), a total of 23940 code points, which can represent 21,003 Chinese characters, and is compatible with GB2312, that is, Chinese characters encoded with GB2312 can be decoded by GBK without garbled characters .

5.GB18030

"Information Technology Chinese Coded Character Set", a mandatory standard in my country, is compatible with GB2312 encoding, and is not widely used in practice.

6.UNICODE character set

UNICODE (Universal Code) character set has multiple encoding methods, namely UTF-8, UTF-16 and UTF-32. Unicode is the basis of Java and XML. The storage form of Unicode in the computer will be described in detail below.
UTF-16 uses two bytes to represent the Unicode conversion format, and uses a fixed-length representation method, that is, any character can be represented by two bytes. Two bytes are 16bit, so it is called UTF-16.
Every two bytes represent a character, which greatly simplifies string operations, so java uses UTF-16 as the character storage format in memory.
UTF-16 uniformly uses two bytes to represent a character, which is simple and convenient to represent, but a large part of characters that can be represented by one also needs to use two bytes. The storage space is doubled, which will increase the network. Transfer traffic, and it is not necessary. Therefore, UTF-8 adopts a variable length technology, each encoding area has a different code length, and different types of characters can be composed of 1 to 6 bytes.
UTF-8 has the following encoding rules:
●If it is 1 byte, and the highest bit (eighth bit) is 0, it means that this is an ASCII character (00~7F), so the ASCII encoding is already UTF-8;
●If it is 1 byte, starting with 11, the number of consecutive 1s implies the number of bytes of this character. For example, 110xxxxx means it is the first byte of double-byte UTF-8;
●If it is 1 byte, starting with 10, it means that it is not the first byte, and you need to search forward to get the first byte of the current character.



JAVA memory encoding uses UTF-16, but it is not suitable for transmission between networks, because network transmission is easy to damage the byte stream, and once the byte stream is damaged, it is difficult to recover.
UTF-8 uses single-byte storage for ASCII characters. Damage to a single character will not affect other characters behind. The encoding efficiency is between GBK and UTF-16. Therefore, UTF-8 is designed for encoding efficiency and encoding security. In order to balance, it is an ideal Chinese coding method.

Guess you like

Origin blog.csdn.net/liushulin183/article/details/50165775