Why does HashMap need to Hash twice

1 Introduction

HashMap must be no stranger to Java programmers. In addition to being often used in daily development, it is also a knowledge point that interviewers like to ask. HashMap is a classic implementation of a hash table. The underlying data structure is an array + linked list. In JDK8, a red-black tree was also introduced to solve the efficiency problem of linear search of linked lists. The design of HashMap is very good. The source code is more than 2,000 lines. There are many points that can be discussed. This article mainly analyzes the purpose of HashMap secondary hashing.

2. The role of the hash code

First of all, we have to understand what is the role of the hash code? The bottom layer of HashMap uses the data structure of array + linked list/red-black tree to store the mapping relationship of key-value pairs. The array is a number of hash slots Solt. HashMap will calculate the subscript Index through the hash code calculated by the key, and the index determines the key. In which slot the value pair should fall. Different hash codes calculate the same subscript Index, which will lead to hash collision. Once hash collision occurs, the search efficiency of HashMap will degenerate from O(1) to O(n) or O(logn). Therefore, a good hash function should be as decentralized as possible, otherwise it will affect the efficiency of HashMap.

3. Secondary Hash

We already know that HashMap will calculate the subscript according to the hash code. The better the dispersion of the hash code, the higher the efficiency of HashMap. Let's first take a look at the process of calculating the subscript of HashMap, and we will know why it needs to do a second Hash.

static int indexFor(int h, int length) {
    return h & (length-1);
}
复制代码

The above is the hash code calculated by HashMap according to the secondary Hash, and the code for calculating the subscript of the key-value pair lengthis the length of the underlying array. HashMap uses bit operation instead of our common modulo operation, you can skip it here, the effect of the two is the same.

Let's first take a look at what will happen if we don't do secondary hashing. Now, I assume the array length is 16, then when the hash code is 5, the subscript Index result is 5.

 00000000000000000000000000000101
&00000000000000000000000000001111
=00000000000000000000000000000101
=5
复制代码

When the hash code is 65541, the subscript Index result is still 5. Different hash codes calculate the same subscript, and the hashes collide.

 00000000000000010000000000000101
&00000000000000000000000000001111
=00000000000000000000000000001101
=5
复制代码

从这个与运算的过程,大家肯定也都发现了,就是哈希码的高位压根就没有参与运算,全部被丢弃了。不管哈希码的高位是多少,都不会影响最终Index的计算结果,因为只有低位才参与了运算,这样的哈希函数我们认为是不好的,它会带来更多的冲突,影响HashMap的效率。

如何解决这个问题呢?最简单的办法就是让高位也参与到运算,高位不一样也会导致最终的Index结果不一样,减少哈希碰撞的概率。事实上,HashMap也就是这么做的,下面是HashMap做二次Hash的源码:

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
复制代码

HashMap通过将哈希码的高16位与低16位进行异或运算,得到一个新的哈希码,这样就可以让高位也参与到运算,这个函数也被称作「扰动函数」。

我们用同样的哈希码,来看看经过二次Hash后的哈希码,是否会带来不一样的效果。 仍然假设数组长度为16,那么当哈希码为5时,下标Index是5,结果不变。

 0000000000000101
^0000000000000000
=0000000000000101

 00000000000000000000000000000101
&00000000000000000000000000001111
=00000000000000000000000000000101
=5
复制代码

当哈希码为65541时,下标Index结果是4,竟然没有发生哈希碰撞。

 0000000000000101
^0000000000000001
=0000000000000100

 00000000000000010000000000000100
&00000000000000000000000000001111
=00000000000000000000000000000100
=4
复制代码

可以看到,HashMap通过加入一个扰动函数,让原本会发生碰撞的两个哈希码,不再冲突。

4. 为啥右移16位

HashMap的扰动函数,是拿高16位和低16位做异或运算,把高位的特征和地位的特征组合起来,以此来降低哈希碰撞的概率。为啥是16位?而不是8位或24位或其它位?

根据哈希码计算下标Index的过程,大家也发现了。实际上,只有数组长度以内的低位才会参与运算。例如数组长度是16,那么只有低4位会参与计算;如果数组长度是256,那么只有低8位会参与计算;如果数组长度是65536,那么只有低16位会参与计算。HashMap取16位是一个折中的数字,绝大部分情况下,HashMap数组的长度都不会超过65536。

5. 总结

HashMap底层采用数组+链表/红黑树来存储键值对,会根据Key的哈希码来计算键值对落在数组的哪个下标。如果不同的哈希码算出相同的下标,就会导致哈希碰撞,影响HashMap的性能。HashMap要做的,就是尽量避免哈希碰撞,所以加入了扰动函数。扰动函数会将哈希码的高16位与低16位做异或运算,让高位也参与到下标的计算过程中来,从而影响最终下标的计算结果,减少哈希碰撞的概率。至于为啥是16位,这是因为哪些位会参与到下标的计算,取决于HashMap数组的长度,在绝大部分情况下,数组的长度都不会超过65536,16位是一个折中的数字。

Guess you like

Origin juejin.im/post/7100829047042605064