Extra Story on the Growth of Programmers--Introduction to Hashmap

I recently looked at the relevant source code of hashmap on the Internet and found that the basic knowledge has been introduced, but some places are still not thorough enough, so I plan to share my understanding with you.

1. Introduction to basic knowledge

Everyone knows that HashMap is a common data structure in java and has a wide range of uses. For example, it can be used as a container for passing parameters, and it can also be used for bean management in spring. It inherits the AbstractMap class and is a descendant of map, so we can often use upcasting to create instances.
like:

Map<Type1, Type2> map = new HashMap<>();

In addition, parameters can also be passed in brackets. The meaning of this parameter is the initCapacity (initial capacity) of the hashmap, which is generally taken to the power of 2 (why this value is taken will be discussed later), and the default is 16.
like:

Map<Type1, Type2> map = new HashMap<>(16);

ps: The capacity here can also be filled with other numbers, such as 10, 14, etc., but it will be automatically converted to a power of 2 not less than the number during initialization. How to do this will be explained later.

If you want to store data in it later, just write it like this:

map.put(key1,value1);
map.put(key2,value2);

To fetch data, write this:

map.get(key1);   //value1
map.get(key2);	 //value2

In addition, the entry element in the map supports cyclic traversal, the code is as follows:

for(Map.Entry<Type1,Type2> entry: map.entrySet()) {
    
    
	String key = entry.getKey();
	String value = entry.getValue();
	// 后续操作
	...
}

It can also be traversed in the following way, the code is as follows

for(Type1 key: map.keySet()) {
    
    
	String value = map.get(key);
	// 后续操作
	...
}

The map can also be traversed through iterators, the code is as follows:

Iterator it = map.entrySet().iterator();
while(it.hasNext()) {
    
    
	Map.Entry<Type1,Type2> entry = it.next();
	//对entry进行处理
	String key = entry.getKey();
	String value = entry.getValue();
	...
}

map can remove elements:

map.remove(key1);

But remember not to use the put or remove method to modify the node in the above two loops, otherwise a java.util.ConcurrentModificationException error will be reported, because in the source code, there is an expectedModCount variable in the iterator of the entry in the map, and in the map There is also a variable called modCount. Take the remove method as an example. When the remove method in the map is called, the modCount in the map will be accumulated, but the expectedModCount in the iterator has not changed. This is achieved in the EntryIterator iterator to obtain the next An error will be reported during the operation of the node, the code is as follows:

//1.该函数出现在HashMap的HashIterator这个抽象类中
class HashMap<K.V> extends AbstractMap<K.V> implements Map<K,V>, Cloneable, Serializable {
    
    
	...
	transient int modCount;//HashMap结构改变的次数
	...
	abstract class HashIterator {
    
    
		Node<K,V> next;
	    Node<K,V> current;
	   	int expectedModCount //迭代器改变的次数
	    int index; 
		...
		final HashMap.Node<K,V> nextNode() {
    
    
			HashNode node2 = this.next;
			//关键代码
			if(hashMap.this.modCount != this.expectedModCount) {
    
    
				throw new ConcurrentModificationException();
			}
			...
		}
		...
	}
	//2.而在map中则通过内部EntryIterator类实现了这个抽象类
	final class EntryIterator extends HashIterator
		implements Iterator<Map.Entry<K,V>> {
    
    
		public final Map.Entry<K,V> next() {
    
     return nextNode(); }
	}
	//3.在Map.Entry中创建EntryIterator实例
	final class EntrySet extends AbstractSet<Map.Entry<K,V>> {
    
    
		public final int size() {
    
     
			return size;
		}
		public final void clear() {
    
     
			HashMap.this.clear();
		}
		public final Iterator<Map.Entry<K,V>> iterator() {
    
    
			return new EntryIterator();
		}
		...
	}
}	

However, if the remove method in the iterator is used, the expectedModCount in the iterator and the modCount in the map will be automatically updated synchronously, thus avoiding error reporting.

map can also replace elements
map.replace(Type1 key, Type2 oldValue, Type2 newValue)

2. Introduction to Storage Principles

The basic structure of hashmap is shown in the figure.
insert image description here
Its bottom-level data structure is an array of linked lists, which is an array composed of linked lists, that is, Node<Type1, Type2>[] table. Node is a class that implements the Map.Entry interface (it was called Entry before 1.8, and it was called Node after 1.8, and the name was changed), which is used to store key-value pairs. Its basic composition code is as follows:

static class Node<K,V> implements Map.Entry<K,V> {
    
    
	final int hash; //hash值用于确定下标位置
	final K key;
	final V value;
	final Node<K,V> next; //链表下一个元素
	// 注意这里是package的访问修饰符,也就是说外部无法通过
	//Map.Node的形式获取该元素,而是要通过自带的static函数
	// map.entrySet()来获取
	Node(int hash, K key, V value, Node<K,V> next) {
    
    
		this.hash = hash;
		this.key = key;
		this.value = value;
		this.next = next;
	}
	public final K getKey() {
    
    
		return key;
	}
	public final V getValue() {
    
    
		return value;
	}
	public final String toString() {
    
    
		return key + "=" + value;
	}
	public final int hashCode(){
    
    
		return Objects.hashCode(key) ^ Objects.hashCode(value);
	}
}

Careful friends must have discovered that the construction method here is package, which can only be accessed in the same package, so the outside cannot obtain this element through the form of Map.Node, but through the built-in method map.entrySet() to get.

So the question is, since we know that the bottom layer of hashmap is a linked list array, how does it store data? How is the hash value in node calculated, and what is its use?

Food has to be eaten bite by bite, and problems have to be solved one by one. First look at the process of data storage: Now we want to store some data in it: {"name": "zhangsan", "age": "16", "sex", "male"}, obviously this is a piece of personal information Data, we can also store it in the hashmap in the form of key-value pairs.

  • Step 1: Call the put(key,value) method
  • Step 2: In the put(key,value) method, we will calculate the hash value of this key-value pair, and the calculation method is as follows:
//JDK1.7
static int hash(Object key){
    
    
	int h = hashSeed;
	if (0 != h && k instanceof String) {
    
    
		return sun.misc.Hashing.stringHash32((String) k);
	}
	h ^= k.hashCode();
	h ^= (h >>> 20) ^ (h >>> 12);
	return h ^ (h >>> 7) ^ (h >>> 4);
}
//JDK1.8
static final int hash(Object key) {
    
    
	int h;
	return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

Taking JDK1.8 as an example, the operation of unsigned right shift is used here (the high bit is filled with 0, and the low bit is right shifted), and the upper 16 bits and the hash value of the key itself are XORed (the & or | operation is not used because the two The operation probability is biased towards 0 or 1, and the distribution is not uniform enough). The advantage of this is that the low bit retains the characteristics of the high bit, reducing the possibility of duplication in the low bit, making the hash value as uniform as possible, and reducing the possibility of repeated conflicts when determining the position of the primary key later . The hash function used here is a hash function, which can make the input of different lengths get the result of the same length . But how to implement the specific hashCode function may require the help of small partners.

  • Step 3: Determine the storage location of the key-value pair in the hashmap according to the hash value. That is index. Regarding this point,
    there is a function called indexFor(int h, int length) in JDK1.7, (JDK1.8 has been adjusted but the principle is the same), the source code is as follows
static int indexFor(int h, int length) {
    
    
	return h & (length-1);
}

Let me explain here, h - indicates the result of the hash function calculation, length indicates the length of the hashmap array, but why should length be reduced by 1?
Before explaining this point, I will make up for the previous pitfalls. Let me first talk about why the default length of the array is a power of 2 (the default is 16): the feature of an
array length of a power of 2 is that it is reduced by one. Each bit after the highest bit of the value is 1, and the data of the lower bits of the hash obtained after performing an AND operation (&) with any number is the result.

What are the benefits of doing this? The advantage is that the bit operation is fast , and the AND operation (not OR or XOR operation) is performed between the hash value and the binary number with 1 bit, which reduces the possibility of conflict and increases the uniformity of distribution .
For example, do AND operation: Take 1101 (hash) and 0111 (length-1 is 8-1) and do AND operation to get 0101 (5) Take 1111
(hash) and 0111 (length-1 is 8-1) for AND operation Get 0111 (7) But if you do OR operation: Take 1101 (hash) and
0111 (length-1 is 8-1) Do OR operation to get 0111 (7) Take 1111 (hash) and 0111 (length-1 is 8-1 )
and the result is 0111 (7) This conflicts.

In short, the process of hash storage is as follows: put(key,value) -> hash(key) -> indexFor(hash) -> index

Then the question comes again, this uniform distribution is impossible to avoid the possibility of repeated conflicts, how to deal with hashmap in this case? It will be stored in the form of red-black tree.

Why should it be stored in the form of a red-black tree later? Why did you choose a cutoff value of 8?
We have learned the data structure and should know that the search length of the red-black tree is O(logn) and the linked list is O(n/2).
When the length is < 8, the difference between logn and n/2 is not large and it is necessary to generate a red-black tree The additional overhead is therefore stored in the form of a linked list, but when the length is greater than 8 (16, 32, 64... because the hashmap expansion is expanded by a factor of 2), the difference between logn and n/2 will become larger, which means Sometimes red-black trees are undoubtedly better.

3. Introduction to the expansion principle

– 2022.1.21 I am back after more than half a month of work -----------------
The expansion of hashmap is actually not complicated if it is complicated, and it is not simple if it is simple. Without further ado, let's get straight to the point. How to implement the expansion mechanism of hashmap?
First let's look at a piece of source code:

	/** 
	 * tableSizeFor 是为了实现hashmap扩容而建立的函数,用于
	 * 对数组进行不小于cap的2的幂次方扩容
	 */
    static final int tableSizeFor(int cap) {
    
    
        int n = cap - 1; 
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
    }

Ok, let's analyze the source code now. The cap here obviously refers to the length of the array in the hashmap to be allocated. But what does the following AND operation and unsigned right shift mean? Why should cap be reduced by one first and then added by one (if it does not exceed the maximum length)?

First we set cap to any number, such as 31. (A relatively classic prime number)
Then 31 - 1 = 30, converted into binary is
00000000 00000000 00000000 00011110 (int is 4Bit or 32 bits)
We can see that the highest bit that is not 0 is the fifth bit (from right to left) Then n >>> 1 Yes (unsigned shift to the right by one bit
, that is, all shift to the right by one bit, and 0 is added to the left ) Bit, at this time, if the OR operation is performed with the previous binary number , the fourth and fifth digits must be 1,. (As long as there is a number in one position is 1, the result is 1), the result of n |= n >>> 1 is as follows: 00000000 00000000 00000000 00011111 After that, the above result is unsigned right-shifted by two bits again to get 00000000 00000000 00000000 00000111 The original 1 on the fourth and fifth digits is moved to the second and third digits, and then the OR operation is performed, that is, the result of n |= n >>> 2 is: 00000000 00000000 00000000 00011111 The results after the same reasoning can be proved. It can be seen that after the above code, looking from right to left, the low part is all 1, and the high part is all 0, plus 1 is the power of 2. Isn't this algorithm wonderful!








So what is the purpose of this calculation? Where is the ingenuity of this algorithm?
The purpose of this algorithm is to convert a number that is not a power of 2 into a number that is not less than a
power of 2 of the number. The ingenious combination of operation and unsigned right shift reduces the amount of operation, and the result can be obtained by performing only 5 times. Awesome!

And why does the cap need to be subtracted by one first and then added by one (if it does not exceed the maximum length)? Let’s
take another example, let’s take the cap as 32. If you don’t subtract one first,
32 is converted to binary as
00000000 00000000 00000000 After 00100000
n |= n >>> 1, there are
00000000 00000000 00000000 00110000
and then n |= n >>> 2:
00000000 00000000 00000000 00111100
and so on
...
Finally + 1 will not get 32 ​​but 64, that is,
0 0000000 00000000 00000000 01000000 ,
then this number is not less than 32 to the power of 2 (it should be 32)

With the above basic knowledge, we can now talk about the principle of capacity expansion:
Thanks again to the lkforce boss for sharing, the link is as follows:
https://blog.csdn.net/lkforce/article/details/89521318First
of all Take a look at the source code:

	// jdk1.7
    void resize(int newCapacity) {
    
    
        Entry[] oldTable = table;
        int oldCapacity = oldTable.length;
        if (oldCapacity == MAXIMUM_CAPACITY) {
    
    
            threshold = Integer.MAX_VALUE;
            return;
        }
         Entry[] newTable = new Entry[newCapacity];
        transfer(newTable, initHashSeedAsNeeded(newCapacity));
        table = newTable;
        threshold = (int)Math.min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1);
    }
    /**
     * 从原hashmap数组中迁移数据到新数组中.
     */
    void transfer(Entry[] newTable, boolean rehash) {
    
    
        int newCapacity = newTable.length;
        for (Entry<K,V> e : table) {
    
    
            while(null != e) {
    
    
                Entry<K,V> next = e.next;  //step1  
                if (rehash) {
    
     // 重新计算hash值
                    e.hash = null == e.key ? 0 : hash(e.key);
                } 
                int i = indexFor(e.hash, newCapacity);  //step2 
                e.next = newTable[i]; // step3        
                newTable[i] = e;  //断点                     
                e = next;                             
            }
        }
    }

For jdk1.7, the expansion mainly has the following steps:

  1. Allocate new memory space
  2. copy data
  3. Update threshold (threshold) is generally load factor * new array capacity

Here I will mainly talk about the transfer function, which uses a linked list to query from the beginning to the end and it uses the head insertion method to insert.

  1. First, in step1, save the successor node of the current node into next.
  2. In step2, the index position is obtained again according to the current hash value. (As mentioned earlier in the indexFor function, its position is judged according to the lower n bits)
  3. In step3, set the successor node of the current node as newTable[i] to facilitate the reorganization of the linked list, that is to say, point the successor node of the current node to the i-th position on the newTable on the new array.
  4. Step4 is to replace the value of newTable[i] with the current node,
  5. Step5 is to continue to search for the next node in the original array.
    Friends who feel confused here can check out the link below. https://blog.csdn.net/lkforce/article/details/89521318

Note: When this method is used under multi-threading, there will be dead links (circular linked lists) and data loss. For details, see:
https://blog.csdn.net/XiaoHanZuoFengZhou/article/details/105238992

jdk1.8 expansion:

    final Node<K,V>[] resize() {
    
    
        Node<K,V>[] oldTab = table;//首次初始化后table为Null
        int oldCap = (oldTab == null) ? 0 : oldTab.length;
        int oldThr = threshold;//默认构造器的情况下为0
        int newCap, newThr = 0;
        if (oldCap > 0) {
    
    //table扩容过
             //当前table容量大于最大值得时候返回当前table
             if (oldCap >= MAXIMUM_CAPACITY) {
    
    
                threshold = Integer.MAX_VALUE;
                return oldTab;
            }
            else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                     oldCap >= DEFAULT_INITIAL_CAPACITY)
            //table的容量乘以2,threshold的值也乘以2           
            newThr = oldThr << 1; // double threshold
        }
        else if (oldThr > 0) // initial capacity was placed in threshold
        //使用带有初始容量的构造器时,table容量为初始化得到的threshold
        newCap = oldThr;
        else {
    
      //默认构造器下进行扩容  
             // zero initial threshold signifies using defaults
            newCap = DEFAULT_INITIAL_CAPACITY;
            newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
        }
        if (newThr == 0) {
    
    
        //使用带有初始容量的构造器在此处进行扩容
            float ft = (float)newCap * loadFactor;
            newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                      (int)ft : Integer.MAX_VALUE);
        }
        threshold = newThr;
        @SuppressWarnings({
    
    "rawtypes","unchecked"})
        Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
        table = newTab;
        if (oldTab != null) {
    
    
            for (int j = 0; j < oldCap; ++j) {
    
    
                HashMap.Node<K,V> e;
                if ((e = oldTab[j]) != null) {
    
    
                    // help gc
                    oldTab[j] = null;
                    if (e.next == null)
                        // 当前index没有发生hash冲突,直接对2取模,即移位运算hash &(2^n -1)
                        // 扩容都是按照2的幂次方扩容,因此newCap = 2^n
                        newTab[e.hash & (newCap - 1)] = e;
                    else if (e instanceof HashMap.TreeNode)
                        // 当前index对应的节点为红黑树,这里篇幅比较长且需要了解其数据结构跟算法,因此不进行详解,当树的个数小于等于UNTREEIFY_THRESHOLD则转成链表
                        ((HashMap.TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                    else {
    
     // preserve order
                        // 把当前index对应的链表分成两个链表,减少扩容的迁移量
                        HashMap.Node<K,V> loHead = null, loTail = null;
                        HashMap.Node<K,V> hiHead = null, hiTail = null;
                        HashMap.Node<K,V> next;
                        do {
    
    
                            next = e.next;
                            if ((e.hash & oldCap) == 0) {
    
    
                                // 扩容后不需要移动的链表
                                if (loTail == null)
                                    loHead = e;
                                else
                                    loTail.next = e;
                                loTail = e;
                            }
                            else {
    
    
                                // 扩容后需要移动的链表
                                if (hiTail == null)
                                    hiHead = e;
                                else
                                    hiTail.next = e;
                                hiTail = e;
                            }
                        } while ((e = next) != null);
                        if (loTail != null) {
    
    
                            // help gc
                            loTail.next = null;
                            newTab[j] = loHead;
                        }
                        if (hiTail != null) {
    
    
                            // help gc
                            hiTail.next = null;
                            // 扩容长度为当前index位置+旧的容量
                            newTab[j + oldCap] = hiHead;
                        }
                    }
                }
            }
        }
        return newTab;
    }

Guess you like

Origin blog.csdn.net/qq_31236027/article/details/122285586