Article Directory

Introduction
Use variable-length encoded incomplete characters to create strings
char cannot represent all Unicode
Pay attention to the use of Locale
Encoding format in file reading and writing
Don't encode non-character data as a string

Introduction

Strings are the most commonly used java type in our daily encoding process. Languages in various regions of the world are different, even if Unicode is used, different encoding methods, such as UTF-8, UTF-16, UTF-32, etc., will be adopted due to different encoding formats.

What problems will we encounter in the process of using character and string encoding? Let's take a look.

Use variable-length encoded incomplete characters to create strings

The underlying storage char[] of String in java is encoded in UTF-16.

Note that after JDK9, the underlying storage of String has become byte[].

StringBuilder and StringBuffer still use char[].

So when we are using InputStreamReader, OutputStreamWriter and String classes for String reading and writing and construction, we need to involve the conversion of UTF-16 and other encodings.

Let's take a look at the problems that may be encountered when converting from UTF-8 to UTF-16.

First look at the UTF-8 encoding:

UTF-8 uses 1 to 4 bytes to represent the corresponding character, while UTF-16 uses 2 or 4 bytes to represent the corresponding character.

What problems might arise in the conversion?

    public String readByteWrong(InputStream inputStream) throws IOException {
    
    
        byte[] data = new byte[1024];
        int offset = 0;
        int bytesRead = 0;
        String str="";

        while ((bytesRead = inputStream.read(data, offset, data.length - offset)) != -1) {
    
    
            str += new String(data, offset, bytesRead, "UTF-8");
            offset += bytesRead;
            if (offset >= data.length) {
    
    
                throw new IOException("Too much input");
            }
        }
        return str;
    }

In the above code, we read byte from Stream, and convert it to String every time we read byte. Obviously, UTF-8 is a variable-length encoding. If part of the UTF-8 code is read during the byte reading process, the constructed String will be wrong.

We need to do the following:

    public String readByteCorrect(InputStream inputStream) throws IOException {
    
    
        Reader r = new InputStreamReader(inputStream, "UTF-8");
        char[] data = new char[1024];
        int offset = 0;
        int charRead = 0;
        String str="";

        while ((charRead = r.read(data, offset, data.length - offset)) != -1) {
    
    
            str += new String(data, offset, charRead);
            offset += charRead;
            if (offset >= data.length) {
    
    
                throw new IOException("Too much input");
            }
        }
        return str;
    }

We use InputStreamReader, the reader will automatically convert the read data into char, that is to say, automatically convert UTF-8 to UTF-16.

So there will be no problems.

char cannot represent all Unicode

Because char is encoded using UTF-16, for UTF-16, U+0000 to U+D7FF and U+E000 to U+FFFF, the characters in this range can be directly represented by a char.

But for U+010000 to U+10FFFF, two chars in the range of 0xD800-0xDBFF and 0xDC00-0xDFFF are used to represent.

In this case, the combination of two chars is interesting, and a single char has no meaning.

Consider the following one of our subString methods. The original intention of this method is to find the position of the first non-letter from the input string, and then intercept the string.

public static String subStringWrong(String string) {
    
    
        char ch;
        int i;
        for (i = 0; i < string.length(); i += 1) {
    
    
            ch = string.charAt(i);
            if (!Character.isLetter(ch)) {
    
    
                break;
            }
        }
        return string.substring(i);
    }

In the above example, we take out the char characters in the string one by one for comparison. If you encounter a character in the range of U+010000 to U+10FFFF, you may get an error, mistakenly thinking that the character is not a letter.

We can modify it like this:

public static String subStringCorrect(String string) {
    
    
        int ch;
        int i;
        for (i = 0; i < string.length(); i += Character.charCount(ch)) {
    
    
            ch = string.codePointAt(i);
            if (!Character.isLetter(ch)) {
    
    
                break;
            }
        }
        return string.substring(i);
    }

We use the codePointAt method of string to return the Unicode code point of the string, and then use the code point to determine isLetter.

Pay attention to the use of Locale

In order to achieve internationalization support, java introduced the concept of Locale, and because of the Locale, it will cause unexpected changes in the string conversion process.

Consider the following example:

    public void toUpperCaseWrong(String input){
    
    
        if(input.toUpperCase().equals("JOKER")){
    
    
            System.out.println("match!");
        }
    }

What we expect is English. If the system sets Locale to be other languages, input.toUpperCase() may get completely different results.

Fortunately, toUpperCase provides a locale parameter, we can modify it like this:

    public void toUpperCaseRight(String input){
    
    
        if(input.toUpperCase(Locale.ENGLISH).equals("JOKER")){
    
    
            System.out.println("match!");
        }
    }

Similarly, DateFormat also has problems:

    public void getDateInstanceWrong(Date date){
    
    
        String myString = DateFormat.getDateInstance().format(date);
    }

    public void getDateInstanceRight(Date date){
    
    
        String myString = DateFormat.getDateInstance(DateFormat.MEDIUM, Locale.US).format(date);
    }

When we compare strings, we must consider the Locale influence.

Encoding format in file reading and writing

When we use InputStream and OutputStream to write files, because it is binary, there is no problem of encoding conversion.

But if we use Reader and Writer to target files, we need to consider the issue of file encoding.

If the file is UTF-8 encoded and we use UTF-16 to read it, there will definitely be problems.

Consider the following example:

    public void fileOperationWrong(String inputFile,String outputFile) throws IOException {
    
    
        BufferedReader reader = new BufferedReader(new FileReader(inputFile));
        PrintWriter writer = new PrintWriter(new FileWriter(outputFile));
        int line = 0;
        while (reader.ready()) {
    
    
            line++;
            writer.println(line + ": " + reader.readLine());
        }
        reader.close();
        writer.close();
    }

We want to read the source file and insert the line number into the new file, but we did not consider the encoding problem, so it may fail.

We can modify the above code as follows:

BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(inputFile), Charset.forName("UTF8")));
PrintWriter writer = new PrintWriter(new OutputStreamWriter(new FileOutputStream(outputFile), Charset.forName("UTF8")));

By forcing the encoding format to be specified, the correctness of the operation is ensured.

Don't encode non-character data as a string

We often have such a requirement to encode binary data into a string, and then store it in the database.

Binary is represented by Byte, but from our introduction above, we can know that not all Bytes can be represented as characters. If the Byte that cannot be expressed as a character is converted into characters, problems may occur.

Look at the following example:

    public void convertBigIntegerWrong(){
    
    
        BigInteger x = new BigInteger("1234567891011");
        System.out.println(x);
        byte[] byteArray = x.toByteArray();
        String s = new String(byteArray);
        byteArray = s.getBytes();
        x = new BigInteger(byteArray);
        System.out.println(x);
    }

In the above example, we convert BigInteger to byte number (big-endian sequence), and then convert byte number to String. Finally, the String is converted to BigInteger.

First look at the results:

1234567891011
80908592843917379

Found that no conversion was successful.

Although String can receive the second parameter and pass in the character encoding, the currently supported character encodings in java are: ASCII, ISO-8859-1, UTF-8, UTF-8BE, UTF-8LE, UTF-16, these. String is also big-endian by default.

How to modify the above example?

    public void convertBigIntegerRight(){
    
    
        BigInteger x = new BigInteger("1234567891011");
        String s = x.toString();  //转换成为可以存储的字符串
        byte[] byteArray = s.getBytes();
        String ns = new String(byteArray);
        x = new BigInteger(ns);
        System.out.println(x);
    }

We can first convert BigInteger into a representable string using the toString method, and then convert it.

We can also use Base64 to encode the Byte array without losing any characters, as shown below:

    public void convertBigIntegerWithBase64(){
    
    
        BigInteger x = new BigInteger("1234567891011");
        byte[] byteArray = x.toByteArray();
        String s = Base64.getEncoder().encodeToString(byteArray);
        byteArray = Base64.getDecoder().decode(s);
        x = new BigInteger(byteArray);
        System.out.println(x);

    }

The code of this article:

learn-java-base-9-to-20/tree/master/security

This article has been included in http://www.flydean.com/java-security-code-line-string/

The most popular interpretation, the most profound dry goods, the most concise tutorial, and many tips you don't know are waiting for you to discover!

Welcome to pay attention to my official account: "programs those things", know technology, know you better!

Java Security Coding Guide: Strings and Encoding