Encoding analysis of getBytes() and new String() in Java

1. getBytes() method

String's getBytes() method: get a byte array encoded by the default encoding of the operating system. So under different operating systems, the returned things may be different!

String.getBytes(String decode) method: According to the specified decode code, return the byte array of the string under the specified code.

Next, let's test it with code:

Return the byte array of the Chinese character "中" in GBK, UTF-8 and ISO8859-1 encoding respectively, and check the length of the array. To verify the difference under different encodings.

public class Test {

    public static void main(String[] args) throws Exception {
        byte[] b_gbk = "中".getBytes("GBK");
        byte[] b_utf8 = "中".getBytes("UTF-8");
        byte[] b_iso88591 = "中".getBytes("ISO8859-1");

        System.out.println("GBK 编码的长度："+b_gbk.length);
        System.out.println("UTF-8 编码的长度："+b_utf8.length);
        System.out.println("ISO8859-1编码的长度："+b_iso88591.length);
    }

}

operation result:

Length of GBK encoding: 2
Length of UTF-8 encoding: 3
Length of ISO8859-1 encoding: 1

2, new String(byte[],decode) method

The new String(byte[], decode) method can restore the array obtained by the "中" character through the getBytes method into a string.

The function of the new String(byte[],decode) method: use the specified decode code to parse byte[] into a string.

Modify the code below:

public class Test {

    public static void main(String[] args) throws Exception {
        byte[] b_gbk = "中".getBytes("GBK");
        byte[] b_utf8 = "中".getBytes("UTF-8");
        byte[] b_iso88591 = "中".getBytes("ISO8859-1");

        String s_gbk = new String(b_gbk,"GBK");
        String s_utf8 = new String(b_utf8,"UTF-8");
        String s_iso88591 = new String(b_iso88591,"ISO8859-1");

        System.out.println("GBK 编码的字符串："+ s_gbk);
        System.out.println("UTF-8 编码的字符串："+ s_utf8);
        System.out.println("ISO8859-1编码的字符串："+ s_iso88591);
    }

}

operation result:

GBK encoded string: Chinese
UTF-8 encoded string: Chinese
ISO8859-1 encoded string: ?

Observing the above results, it is found that s_gbk and s_utf8 can restore the "中" character, while s_iso88591 is ? (which can be understood as garbled characters).

Why can't the character "中" be restored after using ISO8859-1 encoding and recombining?

That's because the encoding table of ISO8859-1 does not , so it is impossible to get the correct encoding value of the word "中" in ISO8859-1 through "中".getBytes("ISO8859-1"); up. So it is even more impossible to restore through newString().

Therefore, when obtaining byte[] through the String.getBytes(String decode) method, it must be confirmed that the code value represented by String does exist in the code table of decode, so that the obtained byte[] array can be restored correctly.

So how to make iso8859-1 encoding support Chinese?

// 得到ISO8859-1编码字符
String s_iso88591 = new String("中".getBytes("UTF-8"),"ISO8859-1")


// 还原，得到正确的中文汉字"中"
String s_utf8 = new String(s_iso88591.getBytes("ISO8859-1"),"UTF-8")

Encoding analysis of getBytes() and new String() in Java

Guess you like