Collection series - HashMap source code analysis

We have analyzed the two collections ArrayList and LinkedList earlier. We know that ArrayList is implemented based on arrays, and LinkedList is implemented based on linked lists. They each have their own advantages and disadvantages, for example, ArrayList is better than LinkedList in locating and finding elements, and LinkedList is better than ArrayList in adding and removing elements. The HashMap introduced in this article combines the advantages of the two. Its bottom layer is implemented based on the hash table. If the hash collision is not considered, the time complexity of HashMap in the operation of adding, deleting, modifying and checking can reach astonishing O( 1). Let's first look at the structure of the hash table on which it is based.

As can be seen from the above figure, the hash table is a structure composed of arrays and linked lists. Of course, the above figure is a bad example. A good hash function should try to average the distribution of elements in the array. Reduce hash collisions and thus reduce the length of the linked list. The longer the length of the linked list, the more nodes that need to be traversed when searching, and the worse the performance of the hash table. Next, let's take a look at some of the member variables of HashMap.

//默认初始容量
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4;

//默认最大容量
static final int MAXIMUM_CAPACITY = 1 << 30;

//默认加载因子, 指哈希表可以达到多满的尺度
static final float DEFAULT_LOAD_FACTOR = 0.75f;

//空的哈希表
static final Entry<?,?>[] EMPTY_TABLE = {};

//实际使用的哈希表
transient Entry<K,V>[] table = (Entry<K,V>[]) EMPTY_TABLE;

//HashMap大小, 即HashMap存储的键值对数量
transient int size;

//键值对的阈值, 用于判断是否需要扩增哈希表容量
int threshold;

//加载因子
final float loadFactor;

//修改次数, 用于fail-fast机制
transient int modCount;

//使用替代哈希的默认阀值
static final int ALTERNATIVE_HASHING_THRESHOLD_DEFAULT = Integer.MAX_VALUE;

//随机的哈希种子, 有助于减少哈希碰撞的次数
transient int hashSeed = 0;


As seen from the member variables, the default initial capacity of HashMap is 16, and the default load factor is 0.75. The threshold is the threshold value of the key-value pair that the collection can store. The default is initial capacity * load factor, which is 16*0.75=12. When the key-value pair exceeds the threshold, it means that the hash table at this time is already in the In the saturated state, adding elements further will increase hash collisions, thereby degrading the performance of HashMap. At this time, the automatic expansion mechanism will be triggered to ensure the performance of HashMap. We can also see that the hash table is actually an Entry array, and each Entry stored in the array is the head node of the singly linked list. This Entry is a static inner class of HashMap, let's take a look at the member variables of Entry.

static class Entry<K,Vimplements Map.Entry<K,V{
   final K key;      //键
   V value;          //值
   Entry<K,V> next;  //下一个Entry的引用
   int hash;         //哈希码
   
   ...               //省略下面代码
}


An Entry instance is a key-value pair, which contains key and value, and each Entry instance has a reference to the next Entry instance. In order to avoid repeated calculation, each Entry instance also stores the corresponding Hash code. It can be said that the Entry array is the core of HashMap, and all operations are done for this array. Because the HashMap source code is relatively long, it is impossible to introduce all its methods comprehensively, so we only focus on the key points to introduce. Next, we will be problem-oriented and explore the internal mechanism of HashMap for the following questions.

1. What operations does HashMap do during construction?

//构造器, 传入初始化容量和加载因子
public HashMap(int initialCapacity, float loadFactor) {
   if (initialCapacity < 0) {
       throw new IllegalArgumentException("Illegal initial capacity: " 
       + initialCapacity);
   }
   //如果初始化容量大于最大容量, 就把它设为最大容量
   if (initialCapacity > MAXIMUM_CAPACITY) {
       initialCapacity = MAXIMUM_CAPACITY;
   }
   //如果加载因子小于0或者加载因子不是浮点数, 则抛出异常
   if (loadFactor <= 0 || Float.isNaN(loadFactor)) {
       throw new IllegalArgumentException("Illegal load factor: " 
       + loadFactor);
   }
   //设置加载因子
   this.loadFactor = loadFactor;
   //阈值为初始化容量
   threshold = initialCapacity;
   init();
}

void init() {}


All constructors will call this constructor. In this constructor, we see that in addition to checking the parameters, it does two things, setting the loading factor to the incoming loading factor, and setting the threshold is the initial size passed in. The init method is empty and does nothing. Note that at this time, an Entry array is not created according to the incoming initialization size. So when do you go to create a new array? Keep looking down.

2. What operation does HashMap do when adding a key-value pair?

//放置key-value键值对到HashMap中
public V put(K key, V value{
   //如果哈希表没有初始化就进行初始化
   if (table == EMPTY_TABLE) {
       //初始化哈希表
       inflateTable(threshold);
   }
   if (key == null) {
       return putForNullKey(value);
   }
   //计算key的hash码
   int hash = hash(key);
   //根据hash码定位在哈希表的位置
   int i = indexFor(hash, table.length);
   for (Entry<K,V> e = table[i]; e != null; e = e.next) {
       Object k;
       //如果对应的key已经存在, 就替换它的value值, 并返回原先的value值
       if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
           V oldValue = e.value;
           e.value = value;
           e.recordAccess(this);
           return oldValue;
       }
   }
   modCount++;
   //如果没有对应的key就添加Entry到HashMap中
   addEntry(hash, key, value, i);
   //添加成功返回null
   return null;
}


As you can see, when adding a key-value pair, it will first check whether the hash table is an empty table, and if it is an empty table, it will be initialized. After that, follow-up operations are performed, and the hash function is called to calculate the Hash code of the incoming key. Locate the specified slot of the Entry array according to the hash code, and then traverse the singly linked list of the slot. If the incoming one already exists, replace it, otherwise, create a new Entry and add it to the hash table.

3. How is the hash table initialized?

//初始化哈希表, 会对哈希表容量进行膨胀, 因为有可能传入的容量不是2的幂
private void inflateTable(int toSize) {
   //哈希表容量必须是2的次幂
   int capacity = roundUpToPowerOf2(toSize);
   //设置阀值, 这里一般是取capacity*loadFactor
   threshold = (int) Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1);
   //新建指定容量的哈希表
   table = new Entry[capacity];
   //初始化哈希种子
   initHashSeedAsNeeded(capacity);
}


As we know above, when constructing HashMap, we will not create a new Entry array, but check whether the current hash table is an empty table during the put operation. If it is an empty table, call the inflateTable method to initialize it. The code of this method is posted above, and you can see that the capacity of the Entry array will be recalculated inside the method, because the initialization size passed in when constructing the HashMap may not be a power of 2, so convert this number to a power of 2 and then go Create a new Entry array according to the new capacity. When initializing the hash table, reset the threshold again. The threshold is generally capacity*loadFactor. In addition, when initializing the hash table, the hash seed (hashSeed) will also be initialized. This hashSeed is used to optimize the hash function. The default value of 0 means that the alternative hash algorithm will not be used, but you can also set the value of hashSeed by yourself. achieve the optimization effect. Details will be discussed below.

4. When does HashMap determine whether to expand, and how does it expand?

//添加Entry方法, 先判断是否要扩容
void addEntry(int hash, K key, V valueint bucketIndex{
   //如果HashMap的大小大于阀值并且哈希表对应槽位的值不为空
   if ((size >= threshold) && (null != table[bucketIndex])) {
       //因为HashMap的大小大于阀值, 表明即将发生哈希冲突, 所以进行扩容
       resize(2 * table.length);
       hash = (null != key) ? hash(key) : 0;
       bucketIndex = indexFor(hash, table.length);
   }
   //在这里表明HashMap的大小没有超过阀值, 所以不需要扩容
   createEntry(hash, key, value, bucketIndex);
}

//对哈希表进行扩容
void resize(int newCapacity{
   Entry[] oldTable = table;
   int oldCapacity = oldTable.length;
   //如果当前已经是最大容量就只能增大阀值了
   if (oldCapacity == MAXIMUM_CAPACITY) {
       threshold = Integer.MAX_VALUE;
       return;
   }
   //否则就进行扩容
   Entry[] newTable = new Entry[newCapacity];
   //迁移哈希表的方法
   transfer(newTable, initHashSeedAsNeeded(newCapacity));
   //将当前哈希表设置为新的哈希表
   table = newTable;
   //更新哈希表阈值
   threshold = (int)Math.min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1);
}


When calling the put method to add a key-value pair, if there is no key in the collection, call the addEntry method to create a new Entry. Seeing the addEntry code posted above, before creating a new Entry, it will first determine whether the size of the current collection element exceeds the threshold, and if it exceeds the threshold, call resize to expand. The incoming new capacity is twice the original hash table, and a new Entry array with twice the original capacity will be created inside the resize method. Then all the elements in the old hash table are migrated to the new hash table, which may be re-hashed, and whether to re-hash is determined according to the value calculated by the initHashSeedAsNeeded method. After completing the migration of the hash table, replace the current hash table with a new one, and finally recalculate the threshold of the HashMap according to the new hash table capacity.

5. Why does the size of the Entry array have to be a power of 2?

//返回哈希码对应的数组下标
static int indexFor(int h, int length) {
   return h & (length-1);
}


indexFor方法是根据hash码来计算出在数组中对应的下标。我们可以看到在这个方法内部使用了与(&)操作符。与操作是对两个操作数进行位运算,如果对应的两个位都为1,结果才为1,否则为0。与操作经常会用于去除操作数的高位值,例如:01011010 & 00001111 = 00001010。我们继续回到代码中,看看h&(length-1)做了些什么。

已知传入的length是Entry数组的长度,我们知道数组下标是从0开始计算的,所以数组的最大下标为length-1。如果length为2的幂,那么length-1的二进制位后面都为1。这时h&(length-1)的作用就是去掉了h的高位值,只留下h的低位值来作为数组的下标。由此可以看到Entry数组的大小规定为2的幂就是为了能够使用这个算法来确定数组的下标。

6. 哈希函数是怎样计算Hash码的?

//生成hash码的函数
final int hash(Object k) {
   int h = hashSeed;
   //key是String类型的就使用另外的哈希算法
   if (0 != h && k instanceof String) {
       return sun.misc.Hashing.stringHash32((String) k);
   }
   h ^= k.hashCode();
   //扰动函数
   h ^= (h >>> 20) ^ (h >>> 12);
   return h ^ (h >>> 7) ^ (h >>> 4);
}


hash方法的最后两行是真正计算hash值的算法,计算hash码的算法被称为扰动函数,所谓的扰动函数就是把所有东西杂糅到一起,可以看到这里使用了四个向右移位运算。目的就是将h的高位值与低位值混合一下,以此增加低位值的随机性。在上面我们知道定位数组的下标是根据hash码的低位值来确定的。key的hash码是通过hashCode方法来生成的,而一个糟糕的hashCode方法生成的hash码的低位值可能会有很大的重复。为了使得hash码在数组上映射的比较均匀,扰动函数就派上用场了,把高位值的特性糅合进低位值,增加低位值的随机性,从而使散列分布的更加松散,以此提高性能。下图举了个例子帮助理解。

7. 替代哈希是怎么回事?

我们看到hash方法中首先会将hashSeed赋值给h。这个hashSeed就是哈希种子,它是一个随机的值,作用就是帮助优化哈希函数。hashSeed默认是0,也就是默认不使用替代哈希算法。那么什么时候使用hashSeed呢?首先需要设置开启替代哈希,在系统属性中设置jdk.map.althashing.threshold的值,在系统属性中这个值默认是-1,当它是-1的时候使用替代哈希的阀值为Integer.MAX_VALUE。这也意味着可能你永远也不会使用替代哈希了。当然你可以把这个阀值设小一点,这样当集合元素达到阀值后就会生成一个随机的hashSeed。以此增加hash函数的随机性。为什么要使用替代哈希呢?当集合元素达到你设定的阀值之后,意味着哈希表已经比较饱和了,出现哈希冲突的可能性就会大大增加,这时对再添加进来的元素使用更加随机的散列函数能够使后面添加进来的元素更加随机的分布在散列表中。

注:以上分析全部基于JDK1.7,不同版本之间会有较大的改动,读者需要注意。

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324689082&siteId=291194637