07 - Explain the design and optimization of HashMap in simple terms

The Collection interface was mentioned in the previous lecture, so in the Java container class, in addition to this interface, a very important Map interface is also defined, which is mainly used to store key-value pair data.

HashMap is one of the most frequently used containers in our daily life, I believe you must be familiar with it. Today we will start with the underlying implementation of HashMap, and gain an in-depth understanding of its design and optimization.

1. Commonly used data structures

When sharing the List collection class, I said that ArrayList is implemented based on the data structure of the array, LinkedList is implemented based on the data structure of the linked list, and the HashMap that I will talk about today is implemented based on the data structure of the hash table. Let's review the commonly used data structures together, which will also help to better understand the content behind.

Array : Use a continuous storage unit to store data. For the search of the specified subscript, the time complexity is O(1), but when inserting data in the middle of the array and at the head, it is necessary to copy and move the following elements.

Linked list : A non-continuous and non-sequential storage structure on the physical storage unit. The logical order of data elements is realized through the order of pointer links in the linked list.

A linked list consists of a series of nodes (each element in the linked list), which can be dynamically generated at runtime. Each node includes two parts: "data field for storing data unit" and "pointer field for storing next node address".

Since the linked list does not have to be stored in order, the linked list can reach the complexity of O(1) when inserting, but it takes O(n) time to find a node or access a specific numbered node.

Hash table : A data structure that is directly accessed according to the key value (Key value). Access records by mapping a key value to a location in the table to speed up lookups. This mapping function is called a hash function, and the array storing the records is called a hash table.

Tree : A collection of hierarchical relationships consisting of n (n≥1) finite nodes, like an upside-down tree.

2. The implementation structure of HashMap

After understanding the data structure, let's look at the implementation structure of HashMap. As the most commonly used Map class, it is implemented based on a hash table, inherits AbstractMap and implements the Map interface.

The hash table maps the Hash value of the key to the memory address, that is, obtains the corresponding value according to the key and stores it in the memory address. That is to say, HashMap determines the storage location of the corresponding value according to the Hash value of the key. Through this indexing method, HashMap can obtain data very quickly.

For example, when storing a key-value pair (x, "aa"), the hash table will obtain the actual storage location of "aa" through the hash function f(x).

But there will also be new problems. If there is another (y, "bb"), the hash value of the hash function f(y) is the same as the previous f(x), so the storage addresses of the two objects conflict, this phenomenon is called hash collision. So how is the hash table solved? There are many ways, such as open addressing method, rehash function method and chain address method.

The open addressing method is very simple. When a hash conflict occurs, if the hash table is not full, it means that there must be a vacant position in the hash table, so the key can be stored in the vacant position of the conflict position. This method has many disadvantages, such as search, expansion, etc., so I don't recommend it as the first choice for solving hash conflicts.

As the name suggests, the rehashing method is to calculate another hash function address when the synonym produces an address conflict until the conflict no longer occurs. This method is not easy to generate "aggregation", but it increases the calculation time. If we do not consider the time cost of adding elements and have extremely high requirements for query elements, we can consider using this algorithm design.

HashMap takes all factors into consideration and uses the chain address method to solve the hash conflict problem. This method adopts the data structure of array (hash table) + linked list. When a hash conflict occurs, a linked list structure is used to store data with the same Hash value.

3. Important properties of HashMap

From the source code of HashMap, we can find that HashMap is composed of an array of Nodes, and each Node contains a key-value key-value pair.

transient Node<K,V>[] table;

As an internal class in HashMap, the Node class defines a next pointer in addition to the two attributes of key and value. When there is a hash conflict, HashMap will use the same hash value in the previous array to correspond to the stored Node object, and point to the reference of the newly added Node object with the same hash value through the pointer.

static class Node<K,V> implements Map.Entry<K,V> {
        final int hash;
        final K key;
        V value;
        Node<K,V> next;
 
        Node(int hash, K key, V value, Node<K,V> next) {
            this.hash = hash;
            this.key = key;
            this.value = value;
            this.next = next;
        }
}

HashMap also has two important properties: load factor (loadFactor) and boundary value (threshold). When initializing HashMap, these two key initialization parameters are involved.

    int threshold;

    final float loadFactor;

The LoadFactor attribute is used to indirectly set the size of the memory space of the Entry array (hash table). When the initial HashMap does not set parameters, the default LoadFactor value is 0.75. Why the value of 0.75?

This is because for a hash table using the linked list method, the average time to find an element is O(1+n), where n refers to the length of the linked list traversed, so the larger the load factor, the less space is used. The more sufficient, this means that the longer the linked list is, the lower the search efficiency will be. If the set load factor is too small, the data in the hash table will be too sparse, causing serious waste of space.

So is there any way to solve the problem of high query time complexity caused by too long linked list? You can think about it first, I will talk about it later.

The Threshold of the Entry array is calculated through the initial capacity and LoadFactor. When the initial HashMap does not set parameters, the default boundary value is 12. If we set a small initial capacity during initialization, and the number of Nodes in the HashMap exceeds the boundary value, HashMap will call the resize() method to reallocate the table array. This will cause the array of HashMap to be copied and migrated to another piece of memory, thus affecting the efficiency of HashMap.

4. HashMap adds element optimization

After the initialization is complete, HashMap can use the put() method to add key-value pairs. As can be seen from the source code below, when the program adds a key-value pair to the HashMap, the program first returns the value based on the hashCode() of the key, then calculates the hash value through the hash() method, and then calculates the hash value through the putVal method. (n - 1) & hash determines the storage location of the Node.

 public V put(K key, V value) {
        return putVal(hash(key), key, value, false, true);
    }
 static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }
  if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
        // 通过 putVal 方法中的 (n - 1) & hash 决定该 Node 的存储位置
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);

If you are not very clear about the algorithm of hash() and (n-1)&hash, please read the detailed description below.

Let's first understand the algorithm in the hash() method. If we didn't use the hash() method to calculate the hashCode, but directly used the object's hashCode value, what would happen?

Suppose you want to add two objects a and b, if the length of the array is 16, then the objects a and b are operated by the formula (n - 1) & hash, that is, (16-1)&a.hashCode and (16-1)&b .hashCode, the binary value of 15 is 0000000000000000000000000001111, suppose the hashCode of object A is 1000010001110001000001111000000, and the hashCode of object B is 011101110011100010100001010 0000, you will find that the result of the above AND operation is 0. Such a hash result is too disappointing, obviously not a good hash algorithm.

But if we shift the hashCode value to the right by 16 bits (h >>> 16 represents an unsigned right shift of 16 bits), that is, take half of the int type, just cut the binary number in half, and use the bit XOR operation (If the corresponding positions of the two numbers are opposite, the result is 1, otherwise it is 0). In this way, the above situation can be avoided. This is how the hash() method is implemented. In short, try to mess up the lower 16 bits of the hashCode that actually participate in the operation.

Let me explain how (n - 1) & hash is designed. Here n represents the length of the hash table, and the hash table is used to set the length to the nth power of 2, which just guarantees that (n - 1) & The calculated index value of the hash is always within the index of the table array. For example: when hash=15, n=16, the result is 15; when hash=17, n=16, the result is 1.

After obtaining the storage location of the Node, if it is judged that the Node is not in the hash table, add a new Node and add it to the hash table. I will use a picture to illustrate the whole process:

 

From the figure, we can see that in JDK1.8, HashMap introduced the red-black tree data structure to improve the query efficiency of the linked list.

This is because when the length of the linked list exceeds 8, the query efficiency of the red-black tree is higher than that of the linked list, so when the linked list exceeds 8, HashMap will convert the linked list into a red-black tree. It is worth noting here that the new Increase due to the presence of left-handed, right-handed efficiency will decrease. Speaking of this, the problem of "high query time complexity caused by too long linked list" that I mentioned earlier has been solved.

The following is the implementation source code of put:

    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
            //1、判断当 table 为 null 或者 tab 的长度为 0 时,即 table 尚未初始化,此时通过 resize() 方法得到初始化的 table
            n = (tab = resize()).length;
        if ((p = tab[i = (n - 1) & hash]) == null)
            //1.1、此处通过(n - 1) & hash 计算出的值作为 tab 的下标 i,并另 p 表示 tab[i],也就是该链表第一个节点的位置。并判断 p 是否为 null
            tab[i] = newNode(hash, key, value, null);
            //1.1.1、当 p 为 null 时,表明 tab[i] 上没有任何元素,那么接下来就 new 第一个 Node 节点,调用 newNode 方法返回新节点赋值给 tab[i]
        else {
            //2.1 下面进入 p 不为 null 的情况,有三种情况:p 为链表节点;p 为红黑树节点;p 是链表节点但长度为临界长度 TREEIFY_THRESHOLD,再插入任何元素就要变成红黑树了。
            Node<K,V> e; K k;
            if (p.hash == hash &&
                    ((k = p.key) == key || (key != null && key.equals(k))))
            //2.1.1HashMap 中判断 key 相同的条件是 key 的 hash 相同,并且符合 equals 方法。这里判断了 p.key 是否和插入的 key 相等,如果相等,则将 p 的引用赋给 e
                e = p;
            else if (p instanceof TreeNode)
            //2.1.2 现在开始了第一种情况,p 是红黑树节点,那么肯定插入后仍然是红黑树节点,所以我们直接强制转型 p 后调用 TreeNode.putTreeVal 方法,返回的引用赋给 e
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
            //2.1.3 接下里就是 p 为链表节点的情形,也就是上述说的另外两类情况:插入后还是链表 / 插入后转红黑树。另外,上行转型代码也说明了 TreeNode 是 Node 的一个子类
                for (int binCount = 0; ; ++binCount) {
            // 我们需要一个计数器来计算当前链表的元素个数,并遍历链表,binCount 就是这个计数器

                    if ((e = p.next) == null) {
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1)
            // 插入成功后,要判断是否需要转换为红黑树,因为插入后链表长度加 1,而 binCount 并不包含新节点,所以判断时要将临界阈值减 1
                            treeifyBin(tab, hash);
            // 当新长度满足转换条件时,调用 treeifyBin 方法,将该链表转换为红黑树
                        break;
                    }
                    if (e.hash == hash &&
                            ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    p = e;
                }
            }
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }

5. HashMap get element optimization

When there is only an array in the HashMap, and there is no Node linked list in the array, it is the time when the performance of the HashMap query data is the best. Once a large number of hash collisions occur, a Node linked list will be generated. At this time, each query element may traverse the Node linked list, thereby reducing the performance of querying data.

Especially when the length of the linked list is too long, the performance will be significantly reduced. The use of the red-black tree solves this problem well, reducing the average complexity of the query to O(log(n)). The longer the linked list, the more The query efficiency improvement after the black-mangrove tree replacement is more obvious.

We can also optimize the performance of HashMap in coding, for example, rekey the hashCode() method to reduce hash conflicts, thereby reducing the generation of linked lists, and efficiently using hash tables to improve performance.

6. HashMap expansion optimization

HashMap is also an array-type data structure, so there is also a case of expansion.

In JDK1.7, the entire expansion process of HashMap is to take out the array elements separately. Generally, this element is the last element put into the linked list, and then traverse the one-way linked list elements headed by this element, according to the hash of each traversed element Value computes its index in the new array, then swaps. Such an expansion method will change the tail of the original one-way linked list in the hash collision into the head of the expanded one-way linked list.

In JDK 1.8, HashMap optimizes the expansion operation. Since the length of the expansion array is twice the relationship, it is a change from 0100 to 1000 for the assumption that the initial tableSize = 4 is to be expanded to 8 (shifting one bit to the left is 2 times), and only the original hash value and The bitwise AND operation of the bit shifted to the left (the value of newtable) is 0 or 1. If it is 0, the index remains unchanged, and if it is 1, the index becomes the original index plus the array before expansion.

The reason why the index can be redistributed through this "AND operation" is that the hash value is inherently random, and the hash value obtained by bitwise ANDing the newTable is 0 (the index position before expansion) and 1 (the index position before expansion plus The numerical index of the array length before expansion) is random, so the expansion process can randomly distribute the elements of the previous hash collision to different indices.

7. Summary

HashMap stores key-value pairs in the form of a hash table data structure. The advantage of this design is the high efficiency of querying key-value pairs.

When we use HashMap, we can set the two parameters of initial capacity and loading factor in combination with our own scenarios. When the query operation is relatively frequent, we can appropriately reduce the load factor; if the memory utilization requirement is relatively high, I can appropriately increase the load factor.

We can also set the initial capacity in advance in the case of predicting the amount of stored data (initial capacity = predicted amount of data/loading factor). The advantage of this is that it can reduce the resize() operation and improve the efficiency of HashMap.

HashMap also uses the combination of the two data structures of array and linked list to implement the chain address method. When there is a hash value conflict, the conflicting key-value pairs can be linked into a linked list.

But there is another performance problem in this method. If the linked list is too long, the time complexity of querying data will increase. HashMap uses a red-black tree in Java8 to solve the problem of query performance degradation caused by too long linked list. The following is the data structure diagram of HashMap:

8. Thinking questions

In practical applications, we set the initial capacity, which is generally an integer power of 2. Do you know why?

Guess you like

Origin blog.csdn.net/qq_34272760/article/details/132270761