Analysis of String and hashCode

Analysis of String and hashCode

The hash conflict of the String class is serious, and we should avoid this potential performance pit in system development and design.

1. Sample code

Let's look at a sample code first:


import org.junit.Assert;
import org.junit.Test;
// 演示String类的hash冲突
public class StringHashCodeTest {
    
    
    @Test
    public void testMain() {
    
    
        main(null);
    }
    public static void main(String[] args) {
    
    
        testHash("Aa", "BB"); // 2112
        // 2031744
        testHash("AaAa", "AaBB");
        testHash("BBAa", "AaBB");
    }
    private static void testHash(String s1, String s2) {
    
    
        Assert.assertNotNull(s1);
        Assert.assertNotNull(s2);
        int hash1 = s1.hashCode();
        int hash2 = s2.hashCode();
        System.out.println("s1=" + s1 + "; " + "s1.hash=" + hash1 + ";" + " s2=" + s1 + "; " + "s2.hash=" + hash2);
        Assert.assertEquals(hash1, hash2);
    }
}

Execute the code, through assertions and comments, we can find that the hash conflict of the String class is serious.

Of course, this is allowed, the hashCode method of a class, no matter which object, all returns 1are allowed, but this is not very good. Related information can refer to:

The convention and rewriting principle of hashCode and equals method in Java

2. String#hashCode()Realize

Click on the source code of the String class, refer to the comment information, you can see String#hashCode()that the calculation formula of is:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Among them, s[0], s[1]represents the char value of the i-th character. nis the number of characters.

In JDK11, String#hashCode()the implementation code of the method is:


public final class String
    implements java.io.Serializable,
      Comparable<String>, CharSequence {
    
    

    public int hashCode() {
    
    
        int h = hash;
        if (h == 0 && value.length > 0) {
    
    
            hash = h = isLatin1() ? StringLatin1.hashCode(value)
                                  : StringUTF16.hashCode(value);
        }
        return h;
    }
    // ...
}    

Here is a small knowledge point, Comparable<String>this interface supports generics, when implementing compareTo, you can only compare your own types/and subclasses;

public interface Comparable<T> {
    
    
    // 比较; 返回值类似于减法; this-o; 
    public int compareTo(T o);
}

The hashCode implementation of the String class uses 2 classes with package-level access permissions:

  • java.lang.StringLatin1
  • java.lang.StringUTF16

Determine which method to use based on isLatin1()the return value of ;

Latin1 encoding is an encoding that can be represented by 1 byte, which is somewhat similar to ANSI encoding; refer to Baidu Encyclopedia:

Latin1 is an alias of ISO-8859-1, and it is written as Latin-1 in some circumstances. ISO-8859-1 encoding is a single-byte encoding, backward compatible with ASCII, its encoding range is 0x00-0xFF, 0x00-0x7F is completely consistent with ASCII, 0x80-0x9F is control characters, 0xA0-0xFF is word symbol.

StringLatin1.hashCode(value)The implementation code is:


final class StringLatin1 {
    
    
    public static int hashCode(byte[] value) {
    
    
        int h = 0;
        for (byte v : value) {
    
    
            h = 31 * h + (v & 0xff);
        }
        return h;
    }
    // ...
}

This implementation multiplies the accumulated value by ha prime number 31and adds the last character (v & 0xff), because only 1 byte is needed to represent it, so 0xffit is no problem to use it to erase the high bit.

This implementation is equivalent to the formula listed earlier.

Next, let's look at StringUTF16.hashCode(value)the implementation of the method:


final class StringUTF16 {
    
    
    public static int hashCode(byte[] value) {
    
    
        //  初始值0;
        int h = 0;
        // 每个UTF16字符占2字节, 向右移位1次, 相当于除以2, 得到真实的字符个数;
        int length = value.length >> 1;
        for (int i = 0; i < length; i++) {
    
    
            h = 31 * h + getChar(value, i);
        }
        return h;
    }

3. Endianness

The implementation of the method here getCharis interesting, involving the byte order.


final class StringUTF16 {
    
    
    static char getChar(byte[] val, int index) {
    
    
        assert index >= 0 && index < length(val) : "Trusted caller missed bounds check";
        index <<= 1;
        return (char)(((val[index++] & 0xff) << HI_BYTE_SHIFT) |
                      ((val[index]   & 0xff) << LO_BYTE_SHIFT));
    }
}

For the specific content of big-endian and little-endian, you can refer to Ruan Yifeng's weblog:

Understanding endianness

For a simple understanding, you can refer to this part of the code:


final class StringUTF16 {
    
    
    // 当前系统平台/OS/CPU 是否是大端字节序;
    private static native boolean isBigEndian();

    static final int HI_BYTE_SHIFT;
    static final int LO_BYTE_SHIFT;
    static {
    
    
        if (isBigEndian()) {
    
    
            HI_BYTE_SHIFT = 8;
            LO_BYTE_SHIFT = 0;
        } else {
    
    
            HI_BYTE_SHIFT = 0;
            LO_BYTE_SHIFT = 8;
        }
    }
}

Simple interpretation, byte order, is actually the definition of how to interpret or convert multiple bytes:

  • How to serialize a multi-byte structure into a byte array byte[], because byte[]it is strictly ordered.
  • How to interpret the byte array as a multi-byte structure, for example: char2 bytes, int4 bytes, long8 bytes.

Looking back at the previous StringUTF16#getChar(byte[] val, int index)method, we can deepen our understanding, and we can also see what the shift operator does.

4. Summary

Since the hash conflict of the String class is serious, we must avoid this potential performance pit in system development and design.

August 23, 2022

Guess you like

Origin blog.csdn.net/renfufei/article/details/126485133