[Java IO stream] Detailed explanation of character set usage

foreword

In the previous article about byte streams, when using byte streams to read data in local files, only English is stored in the files, but no Chinese data is stored. We also mentioned that it is not recommended to use the byte stream to read the data of the plain text file, otherwise there will be garbled characters, so why does this happen? I believe that after discussing today's content, you will have a new understanding.

In a computer, any data is stored in the form of binary, a binary number is called a bit, and a byte is composed of 8-bit binary numbers, storing 2 to the 8th power of data 字节是计算机中最小的存储单元.

However, English only needs one byte to store data, why? This requires us to learn the knowledge of character sets. Character sets are also called code tables, such as ASCII character sets. Some commonly used characters are written as a table. Each character corresponds to an integer value. There are 128 data in the ASCII table. Among them, English is all included in it, so it only takes one byte to store English data.

The current text encoding standards mainly include ASCII, GB2312, GBK, Unicode, etc. ASCII encoding is the simplest Western encoding scheme. GB2312, GBK, and GB18030 are national standards for Chinese character encoding schemes. Unicode is the international standard for encoding characters worldwide.

ASCII

Computers store data in binary form, such as characters such as a, b, c, numbers such as 1, 2, 3, and symbols such as +, -, *, etc. are all stored in the computer using binary, but which binary There is no uniform standard for which character a number represents, and a standard needs to be specified at this time, so the ASCII character set is produced. The ASCII character set is stipulated by relevant organizations in the United States, and there are 128 data in it.

For detailed information, please check the ASCII code table. Commonly used characters are 48 to 57 are ten Arabic numerals, 65 to 90 are 26 uppercase English letters, and 97 to 122 are lowercase English letters.

The data queried in the character set will not be directly stored in the computer. This involves the concept of encoding and decoding. Encoding is to calculate the data queried in the character set according to certain rules and become the real data stored in the computer hard disk. The process of binary data, decoding is the process of calculating the binary data actually stored in the computer according to certain rules and turning it into a value in the character set. The ASCII character set encoding is to directly add 0 in front, and convert it directly to decimal when decoding.

The storage rules for the computer are represented below:

image-20230116194850128
The ASCII table only contains characters commonly used in some European countries, and does not contain Chinese characters. Our country released the GB2312 (national standard) character set in 1980, which included commonly used graphic characters and Simplified Chinese, and Taiwan released the Traditional Chinese character set BIG5. Chinese, Japanese and Korean characters, etc. With the development of computers, the Unicode character set appeared, which defines a unique code for the character set of each language in the world.

The Simplified Chinese Windows system we use uses the GBK character set by default, and different Windows use different versions of the character set, collectively called ANSI.

GBK

We mainly learn how English and Chinese characters use character sets to store data in computers.

In GBK, English also uses one byte for storage, because it is fully compatible with the ASCII character set, and the encoding rules are the same as ASCII when storing:

image-20230116200359193

In the GBK character set, Chinese characters are stored in two bytes, because it is obviously not enough to store up to 256 data in one byte. In the GBK character set, the 8 bits on the left are called the high-order byte, and the 8 bits on the right are called the low-order byte. The conversion of the high-order byte to decimal must be a negative number, and the first binary number of the high-order byte must be 1. This is In order to distinguish between Chinese characters and English, when reading data, when 1 appears, the computer will think that Chinese characters are read, and at this time, the data of two characters will be decoded into one Chinese character.

image-20230116201545704

Unicode

The use of different character sets in different countries is obviously not conducive to the development of software. In order to unify the rules, Unicode character sets appeared. The Unicode character set is developed by the Unicode Consortium composed of software developers, computer manufacturers, etc. around the world.

Likewise, Unicode is fully compatible with the ASCII character set, but its encoding rules are different. For example, UTF-16 (using 2-4 bytes), UTF-32 (fixed using 4 bytes), UTF-8, etc., where UTF-8 is a commonly used encoding method in our actual development, which uses 1 to 4 A variable character encoding of bytes.

In the UTF-8 character encoding rules, characters in the ASCII character set are stored in 1 byte, and Chinese, Japanese, and Korean characters are stored in 3 bytes, which have the following rules:

image-20230116204529634

For example, the character "Han" is 47802 in the Unicode character set, and when it is converted into binary 1011101010111010, it is UTF-8 encoded:

11101011 1010101010 10111010

Why there are garbled characters

Garbled characters in the program in Java may be caused by the following problems:

  1. When reading data, the entire Chinese character was not read
  2. Encoding and decoding methods are inconsistent

We know that when using byte stream to read data, only one byte of data can be read at a time by default, and UTF-8 encoding method uses three bytes to store a Chinese character, at this time, there may be a problem that the entire Chinese character is not read situation, so garbled characters appeared.

Garbled characters will also appear when the encoding method and decoding method are different. For example, when storing a Chinese character, use UTF-8 to encode, and the computer will use 3 bytes to store the data. If you use GBK to decode at this time, garbled characters will appear.

所以,为了避免出现乱码,我们不能使用字节流读取文本文件,并且要保证编码和解码的方式统一。Going back to the question in the preface, it is not difficult for us to understand now.

Java uses the following methods for encoding and decoding:

import java.util.Arrays;

public class Test {
    
    
    public static void main(String[] args) {
    
    
        /*
        编码和解码的使用
         */
        String s="Java你好";
        byte[] bytes = s.getBytes();
        System.out.println(Arrays.toString(bytes));

        String s2=new String(bytes);
        System.out.println(s2);
    }
}

At this point, Java uses the default encoding method of the IDE for encoding. We can also use the overloaded method of the getBytes() method to specify the encoding method. When encoding with the specified encoding method, an exception may occur, just throw it. If the encoding and decoding methods are inconsistent, garbled characters will appear.

Example:

import java.io.UnsupportedEncodingException;
import java.util.Arrays;

public class Test {
    
    
    public static void main(String[] args) throws UnsupportedEncodingException {
    
    
        String s="Java你好";
        byte[] bytes = s.getBytes("GBK");
        System.out.println(Arrays.toString(bytes));

        String s2=new String(bytes);
        System.out.println(s2);
    }
}

We specified that the character string should be encoded in GBK, and when the default UTF-8 is used to decode it again, garbled characters appear in the console.

Guess you like

Origin blog.csdn.net/zhangxia_/article/details/128736699