[Concurrent programming] HashMap answers questions

table of Contents

1. Introduce the underlying data structure of HashMap

2. Why should it be changed to "array + linked list + red-black tree"?

3. When do you use a linked list? When to use red-black trees?

4. Why is the threshold for converting a linked list to a red-black tree to 8?

5. Then why is 6 instead of reuse 8 used to switch back to the linked list node?

6. What are the important attributes of HashMap? What are they used for?

7. Does threshold have other functions besides storing the expansion threshold?

8. What is the default initial capacity of HashMap? Are there any restrictions on the capacity of HashMap?

9. How do you calculate the Nth power of 2?

10. You said that the capacity of HashMap must be 2 to the Nth power. Why is this?

11. You said that the default initial capacity of HashMap is 16, why is it 16 and not others?

12. What is the default initial value of the load factor just mentioned?

13. Why is it 0.75 and not others?

14. Why not 0.74 or 0.76?

15. What is the insertion process of HashMap?

16. In the beginning, there was a hash value for calculating the key. How was it designed?

17. Why should the high 16 bits of hashCode participate in the operation?

18. What is the resize process?

19. Both the red-black tree and the linked list are positioned at the index position of the new table through e.hash & oldCap == 0. Why is this?

20. Is HashMap thread safe?

21. Tell me about the infinite loop problem?

22. What are the main optimizations of JDK 1.8?

23. In addition to HashMap, which Maps have you used, and how do you choose when you use them?


1. Introduce the underlying data structure of HashMap

We are now using JDK 1.8, the bottom layer is composed of "array + linked list + red-black tree", as shown in the figure below, but before JDK 1.8 it is composed of "array + linked list".

2. Why should it be changed to "array + linked list + red-black tree"?

The main purpose is to improve the search performance when the hash conflict is serious (the linked list is too long). The search performance using the linked list is O(n), while the red-black tree is O(logn).

3. When do you use a linked list? When to use red-black trees?

For inserts, the linked list node is used by default. When the number of nodes at the same index position reaches 9 after being added (threshold value 8): if the length of the array is greater than or equal to 64 at this time, it will trigger the linked list node to turn to a red-black tree node (treeifyBin); and if the length of the array is less than 64, then It will not trigger the linked list to convert to the red-black tree, but will expand, because the amount of data at this time is still relatively small.

For removal, when the nodes at the same index position reach 6 after removal, and the node at the index position is a red-black tree node, the red-black tree node will be triggered to untreeify the linked list node.

4. Why is the threshold for converting a linked list to a red-black tree to 8?

When we usually design a scheme, two very important factors must be considered: time and space. The same is true for HashMap. Simply put, the threshold of 8 is the result of a trade-off between time and space.

The size of the red-black tree node is about twice that of the linked list node. When there are too few nodes, the search performance advantage of the red-black tree is not obvious, and the author feels that it is not worth the cost of paying twice the space.

Ideally, using a random hash code, the frequency of nodes distributed in the hash bucket follows the Poisson distribution. According to the formula of Poisson distribution, the probability that the number of nodes in the linked list is 8 is 0.00000006 (with the big lotto one The prizes are almost the same, the Chinese University Lotto? Doesn't exist), this probability is low enough, and when it reaches 8 nodes, the performance advantages of the red-black tree will begin to show, so 8 is a more reasonable number.

5. Then why is 6 instead of reuse 8 used to switch back to the linked list node?

If we set more than 8 nodes to convert to a red-black tree, less than 8 will immediately switch to the linked list. When the number of nodes is hovering at 8, the red-black tree and the linked list will be frequently converted, resulting in performance loss.

6. What are the important attributes of HashMap? What are they used for?

In addition to storing our node table array, HashMap has the following important attributes: 1) size: the number of nodes that the HashMap has stored; 2) threshold: the expansion threshold, when the number of HashMap reaches this value, the expansion is triggered . 3) loadFactor: load factor, expansion threshold = capacity * load factor.

7. Does threshold have other functions besides storing the expansion threshold?

When we create a new HashMap object, threshold will also be used to store the initial capacity. HashMap will not initialize the table until we insert the node for the first time to avoid unnecessary waste of space.

8. What is the default initial capacity of HashMap? Are there any restrictions on the capacity of HashMap?

The default initial capacity is 16. The capacity of the HashMap must be 2 to the Nth power, and the HashMap will calculate a minimum 2 Nth power greater than or equal to the capacity according to the capacity we pass in. For example, pass 9 and the capacity is 16.

9. How do you calculate the Nth power of 2?

static final int tableSizeFor(int cap) {
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}

Let's not look at the first line "int n = cap-1", first look at the following 5 lines of calculation.

  • |= (or equal to): This symbol is relatively rare, but "+=" should have been seen before. You should understand it when you see this. For example: a |= b, can be converted to: a = a | b.
  • >>> (unsigned right shift): For example, a >>> b means to move a to the right by the number of bits specified by b. After shifting to the right, the vacant bits on the left are filled with zeros, and the bits shifted out to the right are discarded.

        

Assuming the value of n is 0010 0001, the calculation is as follows:

I believe you should see that these 5 formulas will pass the highest 1 and get 2 1, 4 1, 8 1, 16 1, 32 1s. Of course, how many 1s there are depends on how big our input parameters are, but we are certain that after these 5 calculations, the value obtained is a value with all 1s in the low order, and when it returns +1, it will get 1 A power of 2 greater than n to the Nth power.

At this time, it is very simple to look at the cap-1 at the beginning. This is to deal with the case that cap itself is 2 to the Nth power.

The bottom layer of the computer is binary, and the shift and OR operation is very fast, so this method is very efficient.

10. You said that the capacity of HashMap must be 2 to the Nth power. Why is this?

The formula for calculating the index position is: (n-1) & hash. When n is 2 to the power of N, n-1 is the value of all 1 in the lower bits. At this time, the result of the & operation of any value with n-1 is The low N bits of this value achieve the same effect as modulo, achieving uniform distribution. In fact, this design is based on the formula: x mod 2^n = x & (2^n-1), because the & operation is more efficient than mod.

As shown in the figure below, when n is not 2 to the Nth power, the probability of hash collision increases significantly.

11. You said that the default initial capacity of HashMap is 16, why is it 16 and not others?

I think the main reason for 16 is: 16 is 2 to the Nth power, and it is a more reasonable size. If you use 8 or 32, I think it's OK. In fact, when we create a new HashMap, it is best to set the initial capacity according to our own usage. This is the most reasonable solution.

12. What is the default initial value of the load factor just mentioned?

The default value of the load factor is 0.75.

13. Why is it 0.75 and not others?

This is also the result of a trade-off between time and space. If the value is higher, such as 1, the space overhead will be reduced at this time, but the probability of hash collisions will increase, and the search cost will increase; if the value is low, such as 0.5, the hash collision will be reduced at this time, but half of the space will be Is wasted, so a compromise of 0.75 seems to be a reasonable value.

14. Why not 0.74 or 0.76?

15. What is the insertion process of HashMap?

16. In the beginning, there was a hash value for calculating the key. How was it designed?

Get the hashCode of the key, and perform the exclusive OR (XOR) operation between the high 16 bits of the hashCode and the hashCode to get the final hash value.

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

17. Why should the high 16 bits of hashCode participate in the operation?

The main purpose is to allow the high bits to also participate in the calculation when the length of the table is small, and there will be no too much overhead.

For example, in the figure below, if the high-order operation is not added, since n-1 is 0000 0111, the result depends only on the low-order 3 bits of the hash value. No matter how the high-order bits change, the result is the same.

If we take the high order into the calculation, the index calculation result will not only depend on the low order.

18. What is the resize process?

19. Both the red-black tree and the linked list are positioned at the index position of the new table through e.hash & oldCap == 0. Why is this?

The capacity of the table before expansion is 16, and node a and node b are at the same index position before expansion.

After expansion, the length of the table is 32, and the n-1 of the new table is only one more 1 than the n-1 of the old table (marked in red in the figure).

Because the two nodes are at the same index position in the old table, when calculating the index position of the new table, it only depends on the extra bit (marked in red in the figure) of the new table in the high position, and the value of this bit is exactly equal to oldCap .

Because it only depends on this bit, there are only two cases: 1) (e.hash & oldCap) == 0, then the index position of the new table is the "original index position"; 2) (e.hash & oldCap) != 0, the new table index position is "original index + oldCap position".

20. Is HashMap thread safe?

It's not. Concurrent ModificationException will be thrown when HashMap is modified while overwriting and traversing data under concurrent conditions. Before JDK 1.8, there was still an infinite loop problem.

21. Tell me about the infinite loop problem?

The root cause of the infinite loop is that the expansion of JDK 1.7 adopts the "head insertion method", which will cause the nodes at the same index position to be reversed after expansion. After JDK 1.8, the "tail insertion method" is adopted, and the order of nodes will not be reversed after expansion, and there is no dead loop problem.

The expansion code of JDK 1.7.0 is as follows, it will be easier to understand with an example.

void transfer(Entry[] newTable) {
    Entry[] src = table;
    int newCapacity = newTable.length;
    for (int j = 0; j < src.length; j++) {
        Entry<K,V> e = src[j];
        if (e != null) {
            src[j] = null;
            do {
                Entry<K,V> next = e.next;
                int i = indexFor(e.hash, newCapacity);
                e.next = newTable[i];
                newTable[i] = e;
                e = next;
            } while (e != null);
        }
    }
}

Example: We have a HashMap with a capacity of 2, loadFactor=0.75. At this time, thread 1 and thread 2 insert a piece of data into the HashMap at the same time, and both trigger the expansion process, and then the following process.

  • Before the two threads are inserted into the node and the expansion process is triggered, the structure at this time is as shown in the figure below.

  • Thread 1 expands and executes to the code: Entry<K,V> next = e.next and then it is scheduled to be suspended. The structure at this time is as shown in the figure below.

  • After thread 1 is suspended, thread 2 enters the expansion process and goes through the entire expansion process. The structure at this time is as shown in the figure below.

Since the two threads operate on the same table, the picture can be drawn as shown below.

  • After thread 1 resumes, it continues to complete the first loop process. The structure at this time is as shown in the figure below.

  • Thread 1 continues to complete the second loop, and the structure at this time is as shown in the figure below.

  • Thread 1 continues to execute the third loop, and forms a loop when e.next = newTable[i] is executed. The structure of the third loop after execution is as shown in the figure below.

If thread 1 calls map.get(11) at this time, the tragedy will appear-Infinite Loop.

22. What are the main optimizations of JDK 1.8?

We have talked about the main optimizations of JDK 1.8 just now. The main points are as follows:

1) The underlying data structure is changed from "array + linked list" to "array + linked list + red-black tree", which is mainly to optimize the search performance when the hash conflict is serious and the linked list is too long: O(n) -> O(logn) .

2) The method of calculating the initial capacity of the table has changed. The old method is to continue shifting from 1 to the left until a value greater than or equal to the input parameter capacity is found; the new method is to use "5 shifts + or Equal operation" to calculate.

// JDK 1.7.0
public HashMap(int initialCapacity, float loadFactor) {
    // 省略
    // Find a power of 2 >= initialCapacity
    int capacity = 1;
    while (capacity < initialCapacity)
        capacity <<= 1;
    // ... 省略
}
// JDK 1.8.0_191
static final int tableSizeFor(int cap) {
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}

3) Optimize the calculation method of hash value. The old one uses a blind JB operation, and the new one simply allows the high 16 bits to participate in the calculation.

// JDK 1.7.0
static int hash(int h) {
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}
// JDK 1.8.0_191
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

4) The insertion method is changed from "first insertion method" to "tail insertion method" during expansion, which avoids the infinite loop under concurrency.

5) When the capacity is expanded, the index position of the computing node in the new table is changed from "h & (length-1)" to "hash & oldCap". The performance may not improve much, but the design is more ingenious and elegant.

23. In addition to HashMap, which Maps have you used, and how do you choose when you use them?

 

Guess you like

Origin blog.csdn.net/qq_41893274/article/details/113790727