Detailed explanation of the underlying implementation principle of HashMap


One, quick start

Example: Friends with a certain foundation can selectively skip this step

HashMap is the most frequently used data type for mapping key-value pairs (key and value) by Java programmers. With the update of the JDK version, JDK1.8 optimizes the implementation of the bottom layer of HashMap, including the introduction of red-black tree data structure and optimization of expansion. This article combines the difference between JDK1.7 and JDK1.8, and discusses the data structure realization and functional principle of HashMap in depth.
Java defines an interface java.uti.Map for the mapping in the data structure. This interface mainly has four commonly used implementation classes, namely HashMap, LinkedHashMap, Hashtable, TreeMap, and IdentityHashMap. This article mainly explains HashMap and the underlying implementation principles.

1. Common methods of HashMap


//        Hashmap存值:----------------------------------》 .put("key","value"); ----------》无返回值。
//
//        Hashmap取值:----------------------------------》 .get("key");-------------------》 返回Value的类型。
//
//        Hashmap判断map是否为空:-----------------------》 .isEmpty(); -------------------》返回boolean类型。
//
//        Hashmap判断map中是否存在这个key:--------------》.containsKey("key");------------》返回boolean类型。
//
//        Hashmap判断map中是否含有value:----------------》.containsValue("value");-------》返回boolean类型。
//
//        Hashmap删除这个key值下的value:----------------》.remove("key");-----------------》返回Value的类型。
//
//        Hashmap显示所有的value值:---------------------》.values(); --------------------》返回Value的类型。
//
//        Hashmap显示map里的值得数量:-------------------》.size(); ----------------------》返回int类型
//
//        HashMap显示当前已存的key:---------------------》 .keySet();-------------------》返回Key的类型数组。
//
//        Hashmap显示所有的key和value:-----------------》.entrySet());------------------》返回Key=Value类型数组。
//
//        Hashmap添加另一个同一类型的map:--------------》.putAll(map); -----------------》(参数为另一个同一类型的map)无返回值。
//
//        Hashmap删除这个key和value:------------------》.remove("key", "value");-------》(如果该key值下面对应的是该value值则删除)返回boolean类型。
//
//        Hashmap替换这个key对应的value值(JDK8新增):---》.replace("key","value");-------》返回被替换掉的Value值的类型。
//
//        克隆Hashmap:-------------------------------》.clone(); ---------------------》返回object类型。
//
//        清空Hashmap:-------------------------------》.clear(); ---------------------》无返回值。

2. Several important knowledge points of HashMap

  1. HashMap is an unordered and unsafe data structure.

  2. HashMap is stored in the form of key-value pairs. The key value is unique (it can be null). A key can only correspond to one value, but the value can be repeated.

  3. If you add the same key value to HashMap again, it will overwrite the content corresponding to the key value. This is also different from HashSet. Set adds the same object through add and will not be added to Set again.

  4. HashMap provides the get method, which takes the corresponding value through the key value, but HashSet can only traverse the data and find the object through the iterator.


Two, the difference between JDK7 and JDK8 HashMap

Since we talk about HashMap, we have to talk about the difference between JDK7 and JDK8 (and after jdk8) HashMap:

  1. A red-black tree is added to jdk8. When the length of the linked list is greater than or equal to 8, the linked list will become a red-black tree
  2. The new node of the linked list is inserted into the linked list in a different order (jdk7 is to insert the head node, jdk8 is to insert the tail node because it wants to change the linked list into a red-black tree)
  3. Simplified hash algorithm (jdk8)
  4. Resize logic modification (jdk7 will have an infinite loop, jdk8 will not)

3. HashMap capacity and expansion mechanism

1. The default load factor of HashMap

    /**
     * The load factor used when none specified in constructor.
     */
    static final float DEFAULT_LOAD_FACTOR = 0.75f;
    /**
     *默认的负载因子是0.75f,也就是75% 负载因子的作用就是计算扩容阈值用,比如说使用
     *无参构造方法创建的HashMap 对象,他初始长度默认是16  阈值 = 当前长度 * 0.75  就
     *能算出阈值,当当前长度大于等于阈值的时候HashMap就会进行自动扩容
     */

During the interview, the interviewer often asks a question: Why is the default load factor of HashMap 0.75 instead of 0.5 or the integer 1?
There are two answers:

  1. Threshold = load factor (loadFactor) x capacity (capacity) According to the expansion mechanism of HashMap, he will ensure that the value of capacity (capacity) is always a power of 2 in order to ensure that the result of load factor x capacity is an integer, this value 0.75 (4/3) is more reasonable, because the product of this number and any power of 2 is an integer.

  2. Theoretically speaking, the larger the load factor, the greater the probability of hash collisions. The smaller the load factor, the greater the cost. This is an unavoidable relationship between pros and cons, so through a simple mathematical reasoning, It can be calculated that it is reasonable for this value to be around 0.75

2. HashMap expansion mechanism

After writing data, expansion may be triggered. In the HashMap structure, I remember that there is a field that records the current amount of data. If this data amount field reaches the expansion threshold, it will trigger the expansion operation.

阈值(threshold) = 负载因子(loadFactor) x 容量(capacity) 
当HashMap中table数组(也称为桶)长度 >= 阈值(threshold) 就会自动进行扩容。

扩容的规则是这样的,因为table数组长度必须是2的次方数,扩容其实每次都是按照上一次tableSize位运算得到的就是做一次左移1位运算,
假设当前tableSize是16的话 16转为二进制再向左移一位就得到了32 即 16 << 1 == 32 即扩容后的容量,也就是说扩容后的容量是当前
容量的两倍,但记住HashMap的扩容是采用当前容量向左位移一位(newtableSize = tableSize << 1),得到的扩容后容量,而不是当前容量x2

The question is again, why is the displacement calculation used to calculate the capacity after expansion? Why not directly multiply by 2?
This problem is relatively simple, because the cpu does not support multiplication after all. All multiplication operations are finally converted to addition at the instruction level. This is very inefficient. If bitwise operations are used, it is very important for the cpu. Concise and efficient.

3. Initial length of the hash table array in HashMap

    /**
     * The default initial capacity - MUST be a power of two.
     */
    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

    /**
     * HashMap中散列表数组初始长度为 16 (1 << 4)
     * 创建HashMap的时候可以设置初始化容量和设置负载因子,
     * 但HashMap会自动优化设置的初始化容量参数,确保初始化
     * 容量始终为2的幂
     */

The old question comes again, why is the initial size of HashMap 16?

First, we look at the source code of hashMap and we can see that when a new piece of data is put, it will calculate the subscript located in the table array (also known as the bucket):

int index =key.hashCode()&(length-1);

Each expansion of hahmap is an integer power of 2.

Because the binary is located, (16-1) is 1111, and the last bit is 1, so that it can ensure that the calculated index can be either odd or even, and as long as the keys passed in are sufficiently dispersed and evenly pressed The index obtained at the time of location will reduce duplication, thus reducing the collision of hash and the query efficiency of hashMap.

So when you get here, you might ask? Then 16, 16 is fine, is it as long as it is an integer power of 2?

The answer is yes. Then why not 8,4? Because it is 8 or 4, it is easy to cause map expansion to affect performance. If the allocation is too large, resources will be wasted, so 16 is used as the initial size.


Fourth, the structure of HashMap

The HashMap structure and storage principle of JDK7 and JDK8 and later are different:
Jdk1.7: array + linked list (when the array subscript is the same, the linked list will be used under the subscript)
Jdk1.8: array + linked list + red-black tree (The pre-value is 8 if the length of the linked list is >=8, the linked list will be turned into a red-black tree) The
new element of the linked list in Jdk1.7 is added to the head node of the linked list, first added to the head node of the linked list, and then moved to the array subscript position
Jdk1.8 the new list element is added to the end of the node list
(standard array index by the query, the query efficiency is very high, one by one can traverse the linked list, and the efficiency is very low .jdk1.8 to
the red and black version introduces Tree, when the length of the linked list is greater than or equal to 8, it will turn the linked list into a red-black tree to improve query efficiency)


Five, HashMap storage principle and storage process

1. HashMap storage principle

  1. Get the passed key, call the hash algorithm to get the hash value

  2. After the hash value is obtained, the indexFor method is called, and the subscript of the array is calculated by the obtained hash value and the length
    of the array (convert the hash value and the array capacity to binary, and then
    perform the same operation with the hash value within the range of the array capacity And operation, the same is 1, then 1, otherwise it is 0, and the subscript value of the array is obtained, so as to ensure that the calculated array subscript will not be greater than the current array capacity)

  3. Store the passed key and value into the subscript of the array.

  4. If the array has a subscript and has a value, then a linked list is used, jdk7 adds the new element to the head node and jdk8 adds it to the tail node.

2. HashMap storage process

The previous addressing algorithm is the same. According to the value of the key's hashcode after the high and low bits XOR, and then bitwise and & (table.lingth-1), get an array subscript, and then according to the status of the array subscript , The situation is different, and then the situation is also different, roughly divided into 4 states:

(1.) The first is that the content under the array subscript is empty: In
this case, there is nothing to say. It is enough to directly occupy the slot if the data is empty, and then pack the key and value passed in by the current .put method into Just put a node object in this slot.

(2.) The second case is that the content under the array subscript is not empty, but the node it refers to has not been chained: in
this case, first compare the key of the node object and the key of the current put object. Equal, if they are completely equal, just perform a replace operation, just replace the value under node. in the previous slot with the new value, otherwise, this put operation is a hash conflict of the eight classics. In this case, just add a node after the slot, use the tail interpolation method (as mentioned earlier, jdk7 adds new elements to the head node, and jdk8 adds to the tail node).

(3.) The third is that the content under the array subscript has been chained:
this case is similar to the second case. The first is to iteratively search for the node, look at the key of the element on the linked list, and the current pass. Whether the key over is exactly the same, if it is exactly the same, it is still repleace operation, replace the value in the previous node with the new value from put, otherwise it will be iterated consistently to the end node of the linked list and it does not match the node that is exactly the same, just like the previous one. The same, pack the put data into node and append it to the end of the linked list, and then check the length of the current linked list to see if it has reached the treeing threshold. If the threshold is reached, a treeing method is called. The treeing operations are all in this method. Done inside.

(4.) The fourth situation is that when the conflict is serious, the linked list has been transformed into a red-black tree: the
red-black tree is more complicated. To be clear about this red-black tree, you have to start with TreeNode and TreeNode inherits the Node structure. , Add several fields on the basis of Node, which point to the parent field of the parent node, point to the left field of the left child node, point to the right field of the right child node, and there is a red field that represents the color. This is the basic structure of TreeNode, and then The insertion operation of the red-black tree first finds a suitable insertion point, which is to find the parent node of the inserted node, and then the red-black tree satisfies all the characteristics of the binary tree, so the operation of finding this parent node is exactly the same as the binary tree sorting. Then talk about this binary tree sorting, which is actually the structure mapped by the binary search algorithm, which is an inverted binary tree. Then each node can have its own child nodes. The original left node is smaller than the previous node, and the right node is larger than the current node. Then every time you search one layer down, you can eliminate half of the data. The search efficiency is very efficient, and the search process is also divided into situations.

  1. First of all, the first case is to probe down until the left subtree or the right subtree bit is null, which means that in the entire tree, no TreeNode whose key in the node linked list is consistent with the current put key is found, then the node is detected at this time That is where the parent node is inserted, and then the hash value of the inserted node and the hash value of the parent node are judged to determine whether to insert the left subtree or the right subtree of the parent node. Of course, insertion will break the balance, and a red-black tree balancing algorithm is needed to maintain the balance.

  2. Secondly, the second case is that the root node finds that the key in the TreeNode is exactly the same as the key of the current put during the downward detection process, and then it is also a repleace operation, replacing the value.


6. Why does HashMap introduce red-black trees in jdk8?

In fact, it is mainly to solve the serious problem of chaining caused by the hash conflict before jdk1.8, because the query efficiency of the linked list structure is very low. Unlike an array, it can quickly find the desired value through the index. The linked list can only be one by one. Traversal, when the hash conflict is very serious, and the linked list is too long, it will seriously affect the query performance. The most ideal query efficiency of the hash list itself is O(1). At that time, the chaining is particularly serious after chaining. Will cause the query to degenerate to O(n). In order to solve this problem, the HashMap in jdk8 adds a red-black tree to solve this problem. When the length of the linked list>=8, the linked list will become a red-black tree. The red-black tree is actually A special binary sort tree, this time is complicated...Anyway, it is much better than a list


7. After the expansion of the new table array, how to migrate the data in the old array?

Migration is actually to advance the migration one by one, which is the processing of one bucket by one. It mainly depends on the data state of the current processing bucket. There are also about four states:
these four migration rules are not the same.

(1.) The first is that the content under the array subscript is empty: in
this case, there is nothing to say, no need to deal with it.

(2.) The second case is that the content under the array subscript is not empty, but the node it refers to has not been chained:
when slot is not empty, but the node it refers to has not been chained, it means this slot It has not had a hash conflict, just migrate it directly, calculate its position in the new table according to the tableSize of the new table, and then store it in.

(3.) The third is that a chained node is stored in the slot:
when the next field in the node is not empty, it means that the slot has had a hash conflict. At this time, the linked list saved in the current slot needs to be split. Into two linked lists, high chain and low chain

(4.) The fourth type is that the slot stores the root node of a red-black tree TreeNode object:
this is very complicated, and this article will not introduce too much for the time being (the blogger has not fully understood =_=!)




Guess you like

Origin blog.csdn.net/weixin_49822811/article/details/113804402