Interview and then asked ConcurrentHashMap, training and preparation he put this article!

First, the background:

Thread-safe HashMap

Because multi-threaded environment, the use of Hashmap be put action will cause an infinite loop, resulting in CPU utilization close to 100%, so you can not use a HashMap in concurrency.

Inefficient container HashTable

HashTable container using synchronized to ensure thread safety, but HashTable in the fierce competition in the thread very inefficient.

Because when one thread to access the HashTable synchronization method, synchronization method HashTable other threads access may be blocked or enter the polling state. Such as thread 1 put to use to add elements, thread 2 will not be able to add elements, and also can not use the get method to get the elements using the put method, so the lower the efficiency of more intense competition.

Lock Segmentation

HashTable container showed reasons for low efficiency in a highly competitive environment concurrency, because all access HashTable threads must compete for the same lock

If the container that how locks, each lock to lock the container in which part of the data, so when multiple threads access different data container segments, inter-thread lock contention would not exist, which can effectively improve concurrency access efficiency

This is the lock segment ConcurrentHashMap techniques used, the first data segments by storing, for each piece of data and with a lock, when a thread holding the lock access data wherein a time segment, the other segment of data can also be other threads access.

Some methods require cross-sections, such as size () and containsValue (), they may need to lock the entire table and not just a segment, which requires in order to lock all segments After the operation, but also in order to release locks on all segments .

Here, "in order" it is very important, or is likely to deadlock inside ConcurrentHashMap, is the final segment of the array and its member variables actually final, however, is only the array is not declared as final to ensure that members of the array is final, which need to ensure the implementation. This ensures that no deadlock, because the order to get the lock fixed.

Segment ConcurrentHashMap array is an array of structures and HashEntry structures. Segment is a reentrant lock ReentrantLock, play a role in ConcurrentHashMap lock in, HashEntry key-value pairs are used to store data.

A ConcurrentHashMap contains a Segment array structure Segment HashMap and the like, and an array of a list structure, a Segment contains a HashEntry array, each element of a list is HashEntry structure, a guardian of each Segment array HashEntry element, when the data array HashEntry modified, it must first obtain the corresponding Segment lock.

Second, the application scenarios

When there is a large array as needed when multiple threads share can consider whether to give it to the hierarchical multiple nodes, and avoid large lock. Consider some of the modules and can be positioned by the hash algorithm.

In fact, more than a thread, when the transaction design data table (transaction also reflects a sense synchronization mechanism), you can put a table as an array of synchronization needs, you can consider a transaction separate table if too much data manipulation (this is why you want to avoid a large table), such as the data field split level sub-lists.

Third, source code interpretation

The main entity classes (1.7 and earlier) in ConcurrentHashMap is three: ConcurrentHashMap (whole Hash Table), Segment (barrels), HashEntry (node)

The corresponding relation between the above it can be seen in FIG.

/**
* The segments, each of which is a specialized hash table
*/
final Segment[] segments;

Unchanging (Immutable) and variable (Volatile)

ConcurrentHashMap completely concurrently allowing multiple read operations, a read operation does not require locking. If using conventional techniques, such as the HashMap implementation, if allowed to add or delete elements in the middle of the hash chain, the read operation does not lock the resulting inconsistent data.

ConcurrentHashMap implementation technology is to ensure HashEntry almost immutable. HashEntry each node representative of a hash chain, which structure is shown below:

 static final class HashEntry<K,V> {
     final K key;
     final int hash;
     volatile V value;
     final HashEntry next;
 }

We can see that in addition, other values ​​are the final value is not final, which means you can not add or tail hash chain from the middle or delete nodes, because it needs to be modified next reference value, all of the nodes can only be modified from the head Start. For the put operation, you can always add to the head Hash chain.

But for the remove operation, you may need to remove a node from the middle, which need to be deleted in front of all the nodes of the entire replication again, delete the last node points to the next node node. This will also explain in detail during the delete operation. In order to ensure that the read operation can see the latest value, the value is set to volatile, which avoids the lock.

other

In order to accelerate the speed and positioning of the segment hash slot segments, each segment hash slot number is 2 ^ n, so that the position which can be positioned by calculating the bit segment and the segment hash slot. When the default value concurrency level 16, i.e. the number of segments, the upper 4 bits of the hash values ​​are determined in which segment assignment.

But let's not forget the "Introduction to Algorithms" The lesson: the number of hash slots should not be 2 ^ n, which may result in uneven distribution of hash tank, which requires a re-hash of the hash value again. (This seems a bit redundant)

Positioning operation:

final Segment segmentFor(int hash) {
     return segments[(hash >>> segmentShift) & segmentMask];
 }

Since ConcurrentHashMap use segments Segment locks to protect data in different segments, then the insertion time and get the elements, you must first locate the Segment by hashing. We can see ConcurrentHashMap used first variant algorithm Wang / Jenkins hash of hashCode elements once again hash.

Re-hash, its purpose is to reduce the hash collision, so that the elements can be uniformly distributed over different Segment, thereby improving the efficiency of the access vessel. If the hash of the extreme poor quality, so all the elements are in a Segment, not only slow access elements, segmented lock will be meaningless. I did a test, not directly perform hash calculation by re-hash.

System.out.println(Integer.parseInt("0001111", 2) & 15);
System.out.println(Integer.parseInt("0011111", 2) & 15);
System.out.println(Integer.parseInt("0111111", 2) & 15);
System.out.println(Integer.parseInt("1111111", 2) & 15);

After calculating the hash value of the output of all 15, by this example it can be found if you do not re-hash, hash collision will be very serious, because as long as low as, no matter what the number is high, it is always the same hash value. We then performed the above binary data after re-hash result as follows, for ease of reading, the upper 32 bits make up less than 0, the vertical bar is divided every four.

0100|0111|0110|0111|1101|1010|0100|1110
1111|0111|0100|0011|0000|0001|1011|1000
0111|0111|0110|1001|0100|0110|0011|1110
1000|0011|0000|0000|1100|1000|0001|1010

You can find every bit of data are hashed opened, and every one in this re-hash can participate to make digital hashing them, thereby reducing the hash collision. ConcurrentHashMap by hashing positioning segment.

SegmentShift default is 28, segmentMask 15, then the maximum number of hash binary data is 32-bit, no symbols are moved to the right 28, upper 4 bits does mean that participate hash calculation, (hash >>> segmentShift ) & calculation result segmentMask 4,15,7 and 8 respectively, can be seen that the hash value does not conflict.

final Segment segmentFor(int hash) {
    return segments[(hash >>> segmentShift) & segmentMask];
}

data structure

All members are final, where segmentMask and segmentShift mainly for positioning section, see above segmentFor method.

About the underlying data structures Hash tables here do not want to do too much discussion. A very important aspect of Hash hash table is how to resolve the conflict, ConcurrentHashMap and HashMap same way, all the same hash value of the node in a hash chain. The difference is that the HashMap, using a plurality of ConcurrentHashMap Hash sub-tables, i.e. segment (Segment).

Each Segment table corresponds to a sub-Hash, its data members are as follows:

static final class Segment<K,V> extends ReentrantLock implements Serializable {
         /**
          * The number of elements in this segment's region.
          */
         transient volatileint count;
         /**
          * Number of updates that alter the size of the table. This is
          * used during bulk-read methods to make sure they see a
          * consistent snapshot: If modCounts change during a traversal
          * of segments computing size or checking containsValue, then
          * we might have an inconsistent view of state so (usually)
          * must retry.
          */
         transient int modCount;
         /**
          * The table is rehashed when its size exceeds this threshold.
          * (The value of this field is always (int)(capacity *
          * loadFactor).)
          */
         transient int threshold;
         /**
          * The per-segment table.
          */
         transient volatile HashEntry[] table;
         /**
          * The load factor for the hash table.  Even though this value
          * is same for all segments, it is replicated to avoid needing
          * links to outer object.
          * @serial
          */
         final float loadFactor;
 }

count to count the number of the data segment, which is volatile, it is used to coordinate the read and modify operations to ensure that the reading operation can be read almost the latest changes.

In a coordinated manner is such that every change operation made structural changes, such as adding delete nodes (value not change the structural modifications node) /, should write count value, each read operation is started to be read take the value of count. This takes advantage of enhancements made to Java 5 semantics of volatile, there happens-before relationship between the write and read the same volatile variable.

modCount number of statistics to change the structure of the segment, primarily to detect whether multiple segments change a segment traversal process, will also talk in detail when operations span.

threashold limit value is used to indicate the need for a rehash.

table array storage node segments, each array element is a hash chain, represented by HashEntry. table is volatile, which makes it possible to read the latest table values ​​without synchronization. loadFactor represents the load factor.

Delete remove (key)

public V remove(Object key) {
   hash = hash(key.hashCode());
   return segmentFor(hash).remove(key, hash, null);
}

The entire operation is to locate the segment, and then remove entrusted to the operation section. When a plurality of delete operations concurrently, as long as they are located segments are not the same, they may be performed simultaneously.

Here is the Segment remove method to achieve:

V remove(Object key, int hash, Object value) {
     lock();
     try {
         int c = count - 1;
         HashEntry[] tab = table;
         int index = hash & (tab.length - 1);
         HashEntry first = tab[index];
         HashEntry e = first;
         while (e != null && (e.hash != hash || !key.equals(e.key)))
             e = e.next;
         V oldValue = null;
         if (e != null) {
             V v = e.value;
             if (value == null || value.equals(v)) {
                 oldValue = v;

                 // All entries following removed node can stay
                 // in list, but all preceding ones need to be
                 // cloned.
                 ++modCount;
                 HashEntry newFirst = e.next;
                 *for (HashEntry p = first; p != e; p = p.next)
                     *newFirst = new HashEntry(p.key, p.hash,
                                                   newFirst, p.value);
                 tab[index] = newFirst;
                 count = c; // write-volatile
             }
         }
         return oldValue;
     } finally {
         unlock();
     }
 }

The entire operation is performed in the case of holdings segment locks, blank lines before the line is mainly targeted to be deleted node e. Next, the node if there is no direct return null, otherwise it is necessary to copy it again in front of the node e, the end point to the next node in node e. E node behind replication is not needed, they can be reused.

The middle of the for loop is what to do with it? ** (marked with *) ** from the code, is that after all the entry positional cloning and back to the front to fight, but necessary? Every element is necessary to remove an element before that clone again?

This is actually the entry of invariance to the decision, careful observation entry definitions found in addition to value, all other attributes are used to modify the final, which means that it can no longer be changed after the first set next domain and replaced it all before cloning a node.

As for why the entry is set to invariance, which does not require synchronization with the invariance of access thus saving time-related

The following is a schematic

The second chart is a bit of a problem, the node should be replicated in the value of the node 2 in front of the node value of 1 in the back, which is exactly the reverse order of the original node, but fortunately, this does not affect our discuss.

Remove the entire implementation is not complicated, but requires attention to the following points.

  • First, when there is a node to be deleted, the value To delete the last step count minus one. This must be the last step of the operation, or the read operation may not see the structural modifications made before segment.

  • Second, remove begins execution will assign a local variable table Tab, because the variable table is volatile, read-write volatile variables large overhead. The compiler can not do any reading and writing volatile variables optimization, direct access to non-volatile multiple instance variables had little effect, the compiler will optimize accordingly.

get operation

ConcurrentHashMap the get operation is directly entrusted to the get method Segment direct look Segment get methods:

V get(Object key, int hash) {
     if (count != 0) { // read-volatile 当前桶的数据个数是否为0
         HashEntry e = getFirst(hash);  得到头节点
         while (e != null) {
             if (e.hash == hash && key.equals(e.key)) {
                 V v = e.value;
                 if (v != null)
                     return v;
                 return readValueUnderLock(e); // recheck
             }
             e = e.next;
         }
     }
     returnnull;
 }

get operations do not require a lock.

Unless read value is empty lock will re-read, we know HashTable get method is the need to lock the container, then get ConcurrentHashMap of operations is how to do it unlocked? The reason is its get method to be used in shared variables are defined as volatile

The first step is to visit the count variable, which is a volatile variable, since all the modifying operation is in progress in structural changes will write the final step count variable, through this mechanism to ensure that the operation can be almost get the latest construction updates. For non-structural update, that is, changing the value of the node, due HashEntry the value of the variable is volatile, but also to ensure to read the latest value.

The next step is carried out according to hash and hash key chain ergodic find nodes to be acquired, if not found, direct access back to null. Causes of hash chains do not need to traverse that link pointer locked next is final. However, the head pointer is not final, which is returned by getFirst (hash) method, the value is present in the array table.

This makes getFirst (hash) may return outdated head node, for example, when performing a get method, perform just finished getFirst (hash) after another thread to perform the removal and update the head node, which led get method the return of the head node is not up to date. This is permissible, by coordinating mechanism to count variables, get almost able to read the latest data, although it may not be current. To get the latest data, only a fully synchronized.

Finally, if the request of the node is found, it is determined if a non-null value returned directly, or in a state where a read lock. It may seem hard to understand, the theoretical value of the node can not be empty, since when he has been put to determine if we should throw NullPointerException is empty. The only source is the default value of null values ​​in HashEntry, because HashEntry the value is not final, non-synchronous read is possible to read to a null value.

Look carefully put operation statement: tab[index] = new HashEntry(key, hash, first, value)In this statement, HashEntry constructor and assignment of the value of the tab[index]assignment may be re-ordered, which may cause the node is empty.

Here, when v is empty, it may be a thread is changing node, whereas the previous operation get none of the lock, according to bernstein condition, after reading write or read-write will cause inconsistent data, so be here again on this e lock read it again, which guarantees the correct value.

V readValueUnderLock(HashEntry e) {
     lock();
     try {
         return e.value;
     } finally {
         unlock();
     }
 }

The statistics for the current field count for HashEntry Segement size values ​​stored value. As volatile variables can be held between threads visibility, it can be read simultaneous multithreading, and guaranteed not to read the values ​​expire, but can only be single-threaded write (there is a case that can be multithreaded to write, to write does not depend on the value of the original value), get in there need only read operations do not need to write shared variables count and value, so you can not locked.

The reason does not read the value expired, is based on the principle happen java before memory model, writes to volatile field prior to read, modify and even if two threads simultaneously acquiring volatile variables, get can get the latest operation value, which is the scene with a classic application of volatile replace locks

put operation

Similarly put operation is entrusted to put method section. The following paragraph is put method:

V put(K key, int hash, V value, boolean onlyIfAbsent) {
     lock();
     try {
         int c = count;
         if (c++ > threshold) // ensure capacity
             rehash();
         HashEntry[] tab = table;
         int index = hash & (tab.length - 1);
         HashEntry first = tab[index];
         HashEntry e = first;
         while (e != null && (e.hash != hash || !key.equals(e.key)))
             e = e.next;
         V oldValue;
         if (e != null) {
             oldValue = e.value;
             if (!onlyIfAbsent)
                 e.value = value;
         }
         else {
             oldValue = null;
             ++modCount;
             tab[index] = new HashEntry(key, hash, first, value);
             count = c; // write-volatile
         }
         return oldValue;
     } finally {
         unlock();
     }
 }

This method is also the case in holding section lock (lock the entire segment), which of course is safe for concurrent execution, concurrent modification of data is not carried out, have to have a gauge to determine whether the statements to ensure that capacity is insufficient You can rehash.

Next is to find whether there is a node key of the same, if present, directly replace the value of this node. Otherwise, create a new node and add it to the head hash chain, then we must change the value of modCount and count, also modify the value of count must be placed in the last step.

method calls the method put rehash, ReAsH implemented method was also very compact, the main advantage of the table size is 2 ^ n, not presented here.

And more difficult to understand is the phrase int index = hash & (tab.length - 1), the original segment which is the real hashtable, hashtable that is, each segment is on a traditional sense, as shown above, from the structural difference between the two can be seen, this is needed to find out entry position in which the table, after obtaining the entry is the first node of the chain, if e! = null, found the explanation, it is necessary to replace the value of the node (onlyIfAbsent == false), otherwise, we need a new entry, its successor is the first, and let the tab [index] point to it, what does it mean? In fact, this new entry is inserted into the head of the chain, the rest is very easy to understand

Since the method was needed to put the shared variable write operation, so in order to thread safety, you have to lock in operation shared variables. Segment Put method to first locate, and then insert in the Segment. Undergo insertion operation requires two steps, the first step in determining whether there is need for Segment array HashEntry expansion, the second step is then positioned on the additive element position HashEntry array.

  • ** need for expansion. ** Before inserting the element will first determine whether the Segment where HashEntry array exceeds the capacity (threshold), and if it exceeds the threshold, the array expansion. It is worth mentioning that the expansion judge Segment is more appropriate than HashMap, HashMap is because the insert elements determine whether the element has reached capacity, if it were to reach the expansion, but no new element is inserted after the expansion is likely, then HashMap would be a futile expansion.

  • ** How expansion. ** when the first expansion will create an array of twice the original capacity, then the elements in the original array after re-hash inserted into a new array. In order to efficiently ConcurrentHashMap not for expansion of the entire vessel, but only for a certain segment for expansion.

Another operation is containsKey, this implementation will be much simpler, because it does not need to read the value:

boolean containsKey(Object key, int hash) {
     if (count != 0) { // read-volatile
         HashEntry e = getFirst(hash);
         while (e != null) {
             if (e.hash == hash && key.equals(e.key))
                 returntrue;
             e = e.next;
         }
     }
     returnfalse;
 }

size () operation

If we want to count the size of the entire ConcurrentHashMap in elements, it is necessary to count all Segment size of the element in the summation. Segment in the global variable count is a volatile variable, in multithreaded scenarios, we are not directly count the sum of all the Segment you can get the whole ConcurrentHashMap size of it?

No, although you can get the latest value of the sum of each Segment count, but then got used may accumulate before the count changes, then the statistics will not be permitted. Therefore, the safest approach is time to put all the statistical size of the Segment put, remove and clean all the methods to lock, but this approach is clearly very inefficient.

Because the probability of a change in the count accumulated count operation process, prior to accumulation had a very small, so ConcurrentHashMap approach is to first attempt to count twice by not locking Segment Segment size each way, if the statistical process, the container count changes, then locked again using the statistical approach to the size of all Segment.

So ConcurrentHashMap is how to determine whether the vessel has changed in the statistics when it? Use modCount variable, the variable modCount front element will be added to 1, then compare modCount whether changes in size before and after statistics put, remove and clean method in operation, so that the size of the vessel has changed.

Published 50 original articles · won praise 1711 · Views 2.24 million +

Guess you like

Origin blog.csdn.net/zl1zl2zl3/article/details/105398733