Recognize HashMap

The structure and underlying principles of HashMap:

HashMap is a very commonly used data structure, which is a data structure composed of  an array and a linked list .

Roughly as follows, each place in the array stores an instance of Key-Value, which is called Entry in Java7 and Node in Java8.

Because all its positions are null, an index value will be calculated according to the hash algorithm of the key when the put is inserted. For example, if I put ("Shuaibing", 220), I inserted an element named "Hello". At this time, we will calculate the inserted position index through the hash algorithm, and the calculated index is 2. The result is as follows:

hash("handsome") = 2

Just now we mentioned that there is also a linked list, why do we need a linked list, and what does a linked list look like?

We know that the length of the array is limited. In the limited length, we use hash, and the hash itself is probabilistic, that is, the hash values ​​of Shuai B and B Shuai have a certain probability to be the same:

hash("Bing Shuai") = 2

Then a linked list like this is formed:

Each Node node will save its own hash, key, value, and the next Node node. The Node source code is as follows:

Speaking of the linked list, do you know how the new Entry node (Node) is inserted when it is inserted into the linked list?

Before java 8, the header insertion method was used , which means that the new value will replace the original value of the header, and the original value will be pushed to the linked list, just like the above example, because the designer thinks that the latest value will be replaced later. The possibility of searching is a little more, and it is also to improve the efficiency of searching.

However, after java8, it was changed to tail insertion.

Then why change it to tail insertion?

First, let's look at the expansion mechanism of HashMap :

As mentioned earlier, the capacity of the array is limited. After the data is inserted multiple times, it will be expanded when it reaches a certain amount , that is, resize .

When is the timing for resize expansion?

There are two factors:

  • Capacity : The current length of the HashMap.
  • LoadFactor : load factor, the default value is 0.75f.

How do you understand it? For example, the current array capacity is 100. When you save the 76th array, it is judged that it has reached the rated value of LoadFactor. At this time, it is necessary to resize, so expand the capacity, but the HashMap Capacity expansion is not as simple as simply expanding the capacity.

Expansion? How does it expand?

Divided into two steps:

  • Expansion : Create a new Entry empty array with twice the length of the original array .
  • ReHash : traverse the original Entry array, and re-hash all the Entry into the new array.

Why do you want to re-Hash, isn't it good to copy the past directly?

It is because after the length of the new Entry array is expanded, the rules of Hash will also change accordingly.

Hash formula : index = HashCode (Key) & (Length - 1)

The original array length (Length) is 8, the value obtained by your bit operation is 2, the new array length is 16, and the value obtained by your bit operation is obviously different.

Before expansion:

After expansion:

After talking about the expansion mechanism, let's answer the above question, why did you use the head insertion method before, but changed it to the tail insertion method after java8?

for example:

Now we want to use different threads to insert A, B, and C in the container with a capacity of 2. If we set a breakpoint before resize, it means that the data has been inserted, but the resize has not been executed, so it may be before the expansion Such:

We can see that the linked list points to A->B->C:

Using the head insertion method of the single linked list, new elements at the same position will always be placed at the head of the linked list. The elements on the same Entry chain in the old array may be placed in different positions of the new array after recalculating the index position:

It is possible that B's next pointer points to A:

Once several threads are adjusted, a circular linked list may appear :

If you go to get the value at this time, tragedy will appear-infinite loop.

The above example is to illustrate the head insertion method before 1.7. In the concurrent scenario, there is a fatal problem, that is, a data loop may be formed, and an infinite loop occurs when getting data. ( Although HashMap is not thread-safe )
Before 1.8, the way to deal with hash conflicts is to use linked lists to store data to solve, and use head insertion method to improve certain efficiency .
But after 1.8, this efficiency improvement is dispensable, because the length of the linked list exceeds 7, it is necessary to consider upgrading the red-black tree, so even if the number of tail insertion traversals is limited, the efficiency will not be greatly affected.
Secondly, because of the changes in the data structure after 1.8, when the length of the linked list reaches the threshold, the head interpolation method is not applicable after upgrading to a red-black tree, because the construction of a red-black tree needs to compare and update the sequence, so it cannot be called the head interpolation method. Still plugged in.

Using header insertion will change the order on the linked list, but if tail insertion is used , the original order of linked list elements will be maintained during expansion, and there will be no problem of linked list forming a loop.

That is to say, it was originally A->B, but the linked list is still A->B after expansion:

Java 1.7 may cause an infinite loop when operating HashMap with multiple threads. The reason is that the order of the front and rear linked lists is reversed after the expansion and transfer, and the reference relationship of the nodes in the original linked list is modified during the transfer process.

Under the same premise, Java 1.8 will not cause an infinite loop. The reason is that the order of the linked list remains unchanged after the expansion and transfer, and the reference relationship of the previous nodes is maintained.

Does that mean that Java8 can use HashMap in multithreading?

I think that even if there will be no infinite loop, but through the source code, I can see that the put/get method does not have a synchronization lock. The most likely thing to happen in a multi-threaded situation is: the value of the put in the last second cannot be guaranteed, and the value of the get in the next second will still be the same. value, so thread safety is still not guaranteed.

So what is the default initialization length of HashMap?

I remember that the initial size was 16 when I looked at the source code.

Then why is it 16?

In line 236 of JDK1.8, there is 1<<4, which is 16.

Why use bit operations? Isn't it good to use 16 directly?

The main reason is that the performance of bit operations is good , and the efficiency of bit and operations is much higher than that of arithmetic calculations. Why is the performance of bit operations so good? It is because bit operations directly operate on memory and do not need to perform binary conversion. You must know that computers use binary in the form of data storage.

Then why use 16 instead of other numbers?

To know why it is 16, we have to look at the data storage characteristics of HashMap.

For example , if the key is "Shuaibing", the decimal value is 766132, and the binary value is 10111011000010110100.

Let's look at the calculation formula of index again:

Hash formula: index = HashCode(Key) & (Length- 1)

 Thus, index = 1011101100001011 0100 & (16 -1);

 Thus, index = 1011101100001011 0100 & 15;

And 15 is 1111 in binary, then:

index = 10111011000010110100 & 1111   =  0100  =  4。

The reason for using bit AND operations is that the effect is the same as that of modulo, and the performance will be improved a lot!

It can be said that the final index result obtained by the Hash algorithm depends entirely on the last few digits of the Hashcode value (binary) of the Key .

And because when using  a number that is a power of 2 as the initial size, all binary bits of the value of (Length-1) are 1, in this case, the result of index is equivalent to the last few digits of HashCode (binary) value.

Therefore, as long as the input HashCode itself is evenly distributed, a uniformly distributed hash can be obtained . In other words, Hash conflicts can be minimized.

Then why is it 16? Isn't it an integer power of 2? In theory, it is all right, but if it is 2, 4 or 8, will it be a bit small, and the capacity will be expanded without adding much data, that is, the capacity will be expanded frequently, which will affect performance. Why not 32 or larger, then Isn't it a waste of space, so 16 is reserved as a very suitable experience value.

Another question, why do we need to rewrite the hashCode method when we rewrite the equals method?

Please give me an example using HashMap?

Because in java, all objects are inherited from the Object class. There are two methods equals and hashCode in the Ojbect class , both of which are used to compare whether two objects are equal.

When the equals method is not rewritten, we inherit the default implementation of equals in object. The internal equals is to compare the memory addresses of two objects . Obviously, we have created two new objects, and their memory addresses must be different.

  • For value objects, == compares the values ​​of two objects
  • For reference objects, the memory addresses of the two objects are compared

The characteristics of the hash code are:

For the same object, if it has not been modified (use equals to return true), then its hashcode value is the same whenever it is

For two objects, if their equals returns false, then their hashcode values ​​may also be equal

When adding data to HashMap, there are two situations:

  • The place where the current array index is empty, this situation is very simple, just put the element in it directly.
  • There is already an element occupying the position of the index. In this case, we need to judge whether the element at this position is equal to the current element, and use equals to compare.

If the default rule is used, the addresses of the two objects are compared. That is to say, the two need to be the same object to be equal. Of course, we can also rewrite the equals method to implement our own comparison rules. The most common is to judge whether they are equal by comparing attribute values .

If the two are equal, they will be overwritten directly. If they are not equal, the element will be stored in the structure of the linked list under the original element.

So when we use a custom object as a key, why do we need to rewrite the hashCode method while rewriting the equals method?

Let's take an example to see ( from the perspective of Hashmap ), what kind of problems will happen if equals is rewritten without rewriting hashcode:

public class MyTest {
    private static class Person{
        int idCard;
        String name;
        public Person(int idCard, String name) {
            this.idCard = idCard;
            this.name = name;
        }
        @Override
        public boolean equals(Object o) {
            if (this == o) {
                return true;
            }
            if (o == null || getClass() != o.getClass()){
                return false;
            }
            Person person = (Person) o;
            //两个对象是否等值,通过idCard来确定
            return this.idCard == person.idCard;
        }
    }
    public static void main(String []args){
        HashMap<Person,String> map = new HashMap<Person, String>();
        Person person = new Person(1234,"乔峰");
        //put到hashmap中去
        map.put(person,"天龙八部");
        //get取出,从逻辑上讲应该能输出“天龙八部”
        System.out.println("结果:"+map.get(new Person(1234,"萧峰")));
    }
}

实际输出结果:null

If we already have a certain understanding of the principle of HashMap, this result is not difficult to understand. Although the keys we use are logically equivalent (equal through equals comparison) when performing get and put operations, since the hashCode method is not rewritten, when the put operation is performed, key(hashcode1)–> hash–>indexFor–>final index position, and key(hashcode2)–>hash–>indexFor–>final index position when extracting value through key, because hashcode1 is not equal to hashcode2, it does not locate an array position and returns logically Incorrect value is null.

Therefore, when rewriting the equals method, you must pay attention to rewriting the hashCode method ; at the same time, you must also ensure that two objects that are judged to be equal by equals return the same integer value when calling the hashCode method.

When does the linked list in HashMap in jdk1.8 become a red-black tree?

When the size of the linked list in the Hashmap exceeds eight, it will be automatically converted into a red-black tree, and when the deletion is less than six, it will become a linked list again.

Another question, if HashMap is thread-unsafe, can you talk to me about how you deal with HashMap in thread-safe scenarios?

In such scenarios, we generally use HashTable or ConcurrentHashMap , but because of the concurrency of the former , there are basically no usage scenarios, so we all use ConcurrentHashMap in scenarios where threads are not safe. I have read the source code of HashTable, it is very simple and rude, it is directly locked in the method, the concurrency is very low, and at most one thread is allowed to access at the same time, ConcurrentHashMap is much better, there is a big difference between 1.7 and 1.8, but the concurrency is higher than The former is much better.

Rewriting equals from the perspective of Set must rewrite hashCode

Tell me about ConcurrentHashMap?

Compared with Hashtable, ConcurrentHashMap has a higher degree of concurrency, so in general multi-threaded operations, ConcurrentHashMap is basically selected.

Let's talk about Hashtable first?

Compared with HashMap, Hashtable is thread-safe and suitable for use in multi-threaded situations, but the efficiency is not optimistic.

Tell me about the low efficiency of HashTable?

In the source code, it will be locked when it operates on data, so the efficiency is relatively low.

Besides this, is there any difference between Hashtable and HashMap?

Hashtable does not allow the key or value to be null, while the key value of HashMap can be null.

Why Hashtable does not allow KEY and VALUE to be null, but HashMap can?

Because Hashtable will directly throw a null pointer exception when we put a null value, but HashMap has done special processing:

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

It seems that it is still not clear why Hashtable does not allow the key or value to be null, but the key value of HashMap can be null?

This is because Hashtable uses a fail-safe mechanism (fail-safe) , which makes the data you read this time not necessarily the latest data.

If you use a null value, it will make it impossible to judge whether the corresponding key does not exist or is empty, because you cannot call contain(key) again to judge whether the key exists. ConcurrentHashMap is the same.

Is there any difference?

  • The implementation is different : Hashtable inherits the Dictionary class, while HashMap inherits the AbstractMap class.

    It seems that no one has used this Dictionary, and neither have I.

  • The initial capacity is different : the initial capacity of HashMap is: 16 , the initial capacity of Hashtable is: 11 , and the default load factor of both is: 0.75.

  • The expansion mechanism is different : when the existing capacity is greater than the total capacity * load factor, the HashMap expansion rule is 2 times the current capacity, and the Hashtable expansion rule is 2 times the current capacity + 1.

  • Iterators are different : The Iterator iterator in HashMap is fail-fast, while the Enumerator of Hashtable is not fail-fast.

    Therefore, when other threads change the structure of HashMap, such as adding or deleting elements, ConcurrentModificationException will be thrown, but Hashtable will not.

What is fail-fast?

Fast failure (fail—fast) is a mechanism in java collections. When traversing a collection object with an iterator, if the content of the collection object is modified (added, deleted, modified) during the traversal process, it will throw Concurrent Modification Exception exception.

What is the principle of fail-fast?

The iterator directly accesses the contents of the collection while traversing, and uses a modCount variable during the traversal.

If the content of the collection changes during traversal, the value of modCount will be changed .

Whenever the iterator uses hashNext() / next() before traversing the next element, it will check whether the modCount variable is the expectedmodCount value, and if so, return to traversal; otherwise, an exception will be thrown and the traversal will be terminated.

Tip : The exception throwing condition here is to detect the condition that modCount != expectedmodCount. If the modCount value is modified when the collection changes, and it happens to be set to the expectedmodCount value, an exception will not be thrown.

Therefore, you cannot program concurrent operations depending on whether this exception is thrown or not. This exception is only recommended for detecting concurrent modification bugs.

Tell me about the fail-fast scene?

The collection classes under the java.util package are all based on the fast-failure mechanism , and cannot be modified concurrently under multi-threading (modified during the iteration process), which is considered a security mechanism.

Tip : Safety failure (fail—safe) You can also understand that the containers under the java.util.concurrent package are safe failures, and can be used and modified concurrently under multi-threading.

Next, let's talk about ConcurrentHashMap?

Let's talk about its data structure, and why is its concurrency so high?

The bottom layer of ConcurrentHashMap is based on  数组 + 链表 composition, but the specific implementation is slightly different in jdk1.7 and 1.8. Let me talk about his data structure in 1.7 first:

As shown in the figure above, ConcurrentHashMap is composed of Segment array and HashEntry. Like HashMap, it is still an array plus a linked list

Segment is an internal class of ConcurrentHashMap, the main components are as follows:

static final class Segment<K,V> extends ReentrantLock implements Serializable {

    private static final long serialVersionUID = 2249069246763182397L;
    // 和 HashMap 中的 HashEntry 作用一样,真正存放数据的桶
    transient volatile HashEntry<K,V>[] table;
    transient int count; // 记得快速失败(fail—fast)么?
    transient int modCount; // 大小
    transient int threshold; // 负载因子
    final float loadFactor;
}

HashEntry is similar to HashMap, but the difference is that HashEntry uses volatile to modify its data Value and the next node next.

What are the characteristics of volatile?

  • It ensures the visibility of different threads operating on this variable, that is, a thread modifies the value of a variable, and this new value is immediately visible to other threads. (for visibility )

  • Instruction reordering is prohibited. (to achieve order )

  • volatile can only guarantee atomicity for a single read/write. i++ does not guarantee atomicity for such operations .

I won't introduce too much space, I will talk about the multithreading chapter, everyone knows that it is safe after using it.

Tell me about the reason for his high concurrency?

In principle, ConcurrentHashMap uses segment lock technology, where Segment inherits from ReentrantLock.

Unlike HashTable, both put and get operations need to be synchronized. In theory, ConcurrentHashMap supports thread concurrency of the number of Segment arrays). Whenever a thread occupies a lock to access a segment, it will not affect other segments. That is to say, if the capacity is 16, its concurrency is 16, allowing 16 threads to operate 16 segments at the same time, and it is thread-safe.

public V put(K key, V value) {
    Segment<K,V> s;
    if (value == null)
        throw new NullPointerException();//这就是为啥他不可以put null值的原因
    int hash = hash(key);
    int j = (hash >>> segmentShift) & segmentMask;
    if ((s = (Segment<K,V>)UNSAFE.getObject          
         (segments, (j << SSHIFT) + SBASE)) == null) 
        s = ensureSegment(j);
    return s.put(key, hash, value, false);
}

He locates the Segment first, and then performs the put operation.

Let's take a look at his put source code, and you will know how he achieves thread safety. I have commented the key sentences.

final V put(K key, int hash, V value, boolean onlyIfAbsent) {
     // 将当前 Segment 中的 table 通过 key 的 hashcode 定位到 HashEntry
     HashEntry<K,V> node = tryLock() ? null : scanAndLockForPut(key, hash, value);
     V oldValue;
     try {
          HashEntry<K,V>[] tab = table;
          int index = (tab.length - 1) & hash;
          HashEntry<K,V> first = entryAt(tab, index);
          for (HashEntry<K,V> e = first;;) {
               if (e != null) {
                   K k;
                   // 遍历该 HashEntry,如果不为空则判断传入的 key 和当前遍历的 key 是否相等,相等则覆盖旧的 value。
                   if ((k = e.key) == key || (e.hash == hash && key.equals(k))) {
                        oldValue = e.value;
                        if (!onlyIfAbsent) {
                             e.value = value;
                             ++modCount;
                        }
                        break;
                   }
                   e = e.next;
                } else {
                     // 不为空则需要新建一个 HashEntry 并加入到 Segment 中,同时会先判断是否需要扩容。
                     if (node != null)
                         node.setNext(first);
                     else
                         node = new HashEntry<K,V>(hash, key, value, first);
                         int c = count + 1;
                         if (c > threshold && tab.length < MAXIMUM_CAPACITY)
                            rehash(node);
                         else
                            setEntryAt(tab, index, node);
                         ++modCount;
                         count = c;
                         oldValue = null;
                         break;
                 }
            }
       } finally {
            //释放锁
            unlock();
       }
       return oldValue;
}

First of all, in the first step, it will try to acquire the lock. If the acquisition fails, there must be competition among other threads, and then use  scanAndLockForPut() the spin to acquire the lock.

  1. Attempt to spin to acquire a lock.
  2. If the number of retries is reached  MAX_SCAN_RETRIES , it will be changed to block the lock acquisition to ensure that the acquisition can be successful.

What about the logic of his get?

The get logic is relatively simple. You only need to locate the Key to a specific Segment through the Hash, and then locate the specific element through the Hash once. Since the value attribute in HashEntry is modified with the volatile keyword, memory visibility is guaranteed, so it is the latest value every time it is acquired.

The get method of ConcurrentHashMap is very efficient, because the whole process does not require locking .

Have you found that although 1.7 can support concurrent access to each segment, there are still some problems?

Yes, because it is basically the way of adding an array to a linked list. When we go to query, we have to traverse the linked list, which will lead to low efficiency. This is the same problem as the HashMap of jdk1.7, so it is completely in jdk1.8. improved.

What does his data structure look like in jdk1.8?

Among them, the original Segment segment lock is abandoned, and it is adopted  CAS + synchronized to ensure the security of concurrency.

It is very similar to HashMap. It also changed the previous HashEntry to Node, but the function remains the same. The value and next are modified with volatile to ensure visibility, and a red-black tree is also introduced. When the linked list is greater than a certain value will be converted (default is 8).

What about other value access operations? And how to ensure thread safety?

The put operation of ConcurrentHashMap is still relatively complicated, which can be roughly divided into the following steps:

  1. Calculate the hashcode based on the key.
  2. Determine whether initialization is required.
  3. It is the Node located by the current key. If it is empty, it means that the current location can write data. Use CAS to try to write. If it fails, the spin guarantees success.
  4. If it is in the current location hashcode == MOVED == -1, it needs to be expanded.
  5. If not satisfied, use the synchronized lock to write data.
  6. If the number is greater than TREEIFY_THRESHOLDthen it will be converted to a red-black tree.

You mentioned above what is CAS? What is spin? 

CAS is an implementation of optimistic locking and a lightweight lock.

The flow of the CAS operation is shown in the figure below, and the thread does not lock when reading data. When preparing to write back the data, compare whether the original value has been modified. If it has not been modified by other threads, it will be written back. If it has been modified, the read process will be re-executed.

Can CAS guarantee that the data has not been modified by other threads?

No, for example, the classic ABA problem, CAS can not judge.

What is ABA? 

That is to say, a thread came and changed the value back to B, and another thread came and changed the value back to A. For the thread judged at this time, it was found that its value was still A, so he didn't know what the value was. It has not been modified by others. In fact, if you only pursue the correct final result in many scenes, it doesn't matter.

But in the actual process, it is still necessary to record the modification process, such as the modification of funds, you should have a record every time you modify it, so that it can be traced back.

So how to solve the ABA problem?

Just use the version number to guarantee it. For example, when I query its original value before modifying it, I will bring a version number. Every time I make a judgment, I will judge the value and the version number together. If the judgment is successful, I will add the version number. 1.

The performance of CAS is very high, but I know that the performance of synchronized is not good. Why is there more synchronized after jdk1.8 upgrade?

Synchronized has always been a heavyweight lock before, but later java officials upgraded it, and now it uses the lock upgrade method to do it.

For the synchronized way of acquiring locks, JVM uses an optimization method of lock upgrade, which is to use a biased lock to give priority to the same thread and then acquire the lock again. If it fails, it will be upgraded to a CAS lightweight lock . If it fails, it will spin for a short time to prevent The thread is suspended by the system. Finally, if all of the above fail, upgrade to a heavyweight lock . So it was upgraded step by step, and it was locked in many lightweight ways at first.

What about the get operation of ConcurrentHashMap?

  • According to the calculated hashcode address, if it is on the bucket, then return the value directly.
  • If it is a red-black tree, then get the value according to the tree.
  • If you are not satisfied, then traverse to obtain the value according to the linked list.

 Summary: 1.8 has made a big change in the data structure of 1.7. After using the red-black tree, the query efficiency can be guaranteed ( O(logn)), and even canceled the ReentrantLock and changed it to synchronized. It can be seen that the synchronized optimization is in place in the new version of JDK of.

Guess you like

Origin blog.csdn.net/m0_49508485/article/details/127734606