Analysis of String and hashCode
The hash conflict of the String class is serious, and we should avoid this potential performance pit in system development and design.
1. Sample code
Let's look at a sample code first:
import org.junit.Assert;
import org.junit.Test;
// 演示String类的hash冲突
public class StringHashCodeTest {
@Test
public void testMain() {
main(null);
}
public static void main(String[] args) {
testHash("Aa", "BB"); // 2112
// 2031744
testHash("AaAa", "AaBB");
testHash("BBAa", "AaBB");
}
private static void testHash(String s1, String s2) {
Assert.assertNotNull(s1);
Assert.assertNotNull(s2);
int hash1 = s1.hashCode();
int hash2 = s2.hashCode();
System.out.println("s1=" + s1 + "; " + "s1.hash=" + hash1 + ";" + " s2=" + s1 + "; " + "s2.hash=" + hash2);
Assert.assertEquals(hash1, hash2);
}
}
Execute the code, through assertions and comments, we can find that the hash conflict of the String class is serious.
Of course, this is allowed, the hashCode method of a class, no matter which object, all returns 1
are allowed, but this is not very good. Related information can refer to:
The convention and rewriting principle of hashCode and equals method in Java
2. String#hashCode()
Realize
Click on the source code of the String class, refer to the comment information, you can see String#hashCode()
that the calculation formula of is:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
Among them, s[0]
, s[1]
represents the char value of the i-th character. n
is the number of characters.
In JDK11, String#hashCode()
the implementation code of the method is:
public final class String
implements java.io.Serializable,
Comparable<String>, CharSequence {
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
hash = h = isLatin1() ? StringLatin1.hashCode(value)
: StringUTF16.hashCode(value);
}
return h;
}
// ...
}
Here is a small knowledge point, Comparable<String>
this interface supports generics, when implementing compareTo, you can only compare your own types/and subclasses;
public interface Comparable<T> {
// 比较; 返回值类似于减法; this-o;
public int compareTo(T o);
}
The hashCode implementation of the String class uses 2 classes with package-level access permissions:
java.lang.StringLatin1
java.lang.StringUTF16
Determine which method to use based on isLatin1()
the return value of ;
Latin1 encoding is an encoding that can be represented by 1 byte, which is somewhat similar to ANSI encoding; refer to Baidu Encyclopedia:
Latin1 is an alias of ISO-8859-1, and it is written as Latin-1 in some circumstances. ISO-8859-1 encoding is a single-byte encoding, backward compatible with ASCII, its encoding range is 0x00-0xFF, 0x00-0x7F is completely consistent with ASCII, 0x80-0x9F is control characters, 0xA0-0xFF is word symbol.
StringLatin1.hashCode(value)
The implementation code is:
final class StringLatin1 {
public static int hashCode(byte[] value) {
int h = 0;
for (byte v : value) {
h = 31 * h + (v & 0xff);
}
return h;
}
// ...
}
This implementation multiplies the accumulated value by h
a prime number 31
and adds the last character (v & 0xff)
, because only 1 byte is needed to represent it, so 0xff
it is no problem to use it to erase the high bit.
This implementation is equivalent to the formula listed earlier.
Next, let's look at StringUTF16.hashCode(value)
the implementation of the method:
final class StringUTF16 {
public static int hashCode(byte[] value) {
// 初始值0;
int h = 0;
// 每个UTF16字符占2字节, 向右移位1次, 相当于除以2, 得到真实的字符个数;
int length = value.length >> 1;
for (int i = 0; i < length; i++) {
h = 31 * h + getChar(value, i);
}
return h;
}
3. Endianness
The implementation of the method here getChar
is interesting, involving the byte order.
final class StringUTF16 {
static char getChar(byte[] val, int index) {
assert index >= 0 && index < length(val) : "Trusted caller missed bounds check";
index <<= 1;
return (char)(((val[index++] & 0xff) << HI_BYTE_SHIFT) |
((val[index] & 0xff) << LO_BYTE_SHIFT));
}
}
For the specific content of big-endian and little-endian, you can refer to Ruan Yifeng's weblog:
For a simple understanding, you can refer to this part of the code:
final class StringUTF16 {
// 当前系统平台/OS/CPU 是否是大端字节序;
private static native boolean isBigEndian();
static final int HI_BYTE_SHIFT;
static final int LO_BYTE_SHIFT;
static {
if (isBigEndian()) {
HI_BYTE_SHIFT = 8;
LO_BYTE_SHIFT = 0;
} else {
HI_BYTE_SHIFT = 0;
LO_BYTE_SHIFT = 8;
}
}
}
Simple interpretation, byte order, is actually the definition of how to interpret or convert multiple bytes:
- How to serialize a multi-byte structure into a byte array
byte[]
, becausebyte[]
it is strictly ordered. - How to interpret the byte array as a multi-byte structure, for example:
char
2 bytes,int
4 bytes,long
8 bytes.
Looking back at the previous StringUTF16#getChar(byte[] val, int index)
method, we can deepen our understanding, and we can also see what the shift operator does.
4. Summary
Since the hash conflict of the String class is serious, we must avoid this potential performance pit in system development and design.
August 23, 2022