An article allows you to thoroughly understand Java HashMap and ConcurrentHashMap

Foreword

Map  Key Value is a very classic structure in software development, often used to store data in memory.

This article mainly wants to discuss a concurrent container such as ConcurrentHashMap. Before the official start, I feel it necessary to talk about HashMap. Without it, there will be no ConcurrentHashMap behind.

HashMap

It is well known that the bottom layer of HashMap is based on  数组 + 链表 composition, but the specific implementation is slightly different in jdk1.7 and 1.8.

Base 1.7

The data structure diagram in 1.7:

First look at the implementation in 1.7.

These are the core member variables in HashMap; see what do they mean?

  1. Initialize the bucket size, because the bottom layer is an array, so this is the default size of the array.
  2. The maximum value of the bucket.
  3. The default load factor (0.75)
  4. table The array that actually holds the data.
  5. Map Store the size of the quantity.
  6. The bucket size can be specified explicitly during initialization.
  7. The load factor can be specified explicitly during initialization.

Focus on explaining the load factor:

Since the capacity of a given HashMap is fixed, for example, the default initialization:

public HashMap() {
    this(DEFAULT_INITIAL_CAPACITY, DEFAULT_LOAD_FACTOR);
}
public HashMap(int initialCapacity, float loadFactor) { if (initialCapacity < 0) throw new IllegalArgumentException("Illegal initial capacity: " + initialCapacity); if (initialCapacity > MAXIMUM_CAPACITY) initialCapacity = MAXIMUM_CAPACITY; if (loadFactor <= 0 || Float.isNaN(loadFactor)) throw new IllegalArgumentException("Illegal load factor: " + loadFactor); this.loadFactor = loadFactor; threshold = initialCapacity; init(); }

The default capacity given is 16 and the load factor is 0.75. Map keeps storing data in the process of use. When the number reaches the  16 * 0.75 = 12 capacity, the current capacity of 16 needs to be expanded. The expansion process involves operations such as rehash and copy data, so it is very costly.

Therefore, it is generally recommended that the size of the HashMap can be estimated in advance to minimize the performance loss caused by capacity expansion.

According to the code, you can see that what actually stores the data is

transient Entry<K,V>[] table = (Entry<K,V>[]) EMPTY_TABLE;

This array, so how is it defined?

Entry is an internal class in HashMap, it is easy to see from his member variables:

  • key is the key when writing.
  • value is naturally the value.
  • At the beginning, it was mentioned that HashMap is composed of arrays and linked lists, so this next is used to implement linked list structures.
  • The hash stores the hashcode of the current key.

Knowing the basic structure, let's take a look at the important write and get functions:

put method

public V put(K key, V value) {
    if (table == EMPTY_TABLE) { inflateTable(threshold); } if (key == null) return putForNullKey(value); int hash = hash(key); int i = indexFor(hash, table.length); for (Entry<K,V> e = table[i]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || key.equals(k))) { V oldValue = e.value; e.value = value; e.recordAccess(this); return oldValue; } } modCount++; addEntry(hash, key, value, i); return null; }
  • Determine whether the current array needs to be initialized.
  • If the key is empty, put a null value into it.
  • Calculate the hashcode according to the key.
  • According to the calculated hashcode, locate the bucket.
  • If the bucket is a linked list, it is necessary to traverse to determine whether the hashcode and key in it are equal to the incoming key. If they are equal, they are overwritten and the original value is returned.
  • If the bucket is empty, it means that no data is stored at the current position; an Entry object is added to write to the current position.
void addEntry(int hash, K key, V value, int bucketIndex) { if ((size >= threshold) && (null != table[bucketIndex])) { resize(2 * table.length); hash = (null != key) ? hash(key) : 0; bucketIndex = indexFor(hash, table.length); } createEntry(hash, key, value, bucketIndex); } void createEntry(int hash, K key, V value, int bucketIndex) { Entry<K,V> e = table[bucketIndex]; table[bucketIndex] = new Entry<>(hash, key, value, e); size++; }

When calling addEntry to write an Entry, you need to determine whether expansion is required.

If necessary, double-expand and re-hash and locate the current key.

However createEntry , the bucket at the  current position will be transferred to the newly created bucket, and if the current bucket has a value, a linked list will be formed at the position.

get method

Let's take a look at the get function:

public V get(Object key) {
    if (key == null) return getForNullKey(); Entry<K,V> entry = getEntry(key); return null == entry ? null : entry.getValue(); } final Entry<K,V> getEntry(Object key) { if (size == 0) { return null; } int hash = (key == null) ? 0 : hash(key); for (Entry<K,V> e = table[indexFor(hash, table.length)]; e != null; e = e.next) { Object k; if (e.hash == hash && ((k = e.key) == key || (key != null && key.equals(k)))) return e; } return null; }
  • First, the hashcode is calculated based on the key, and then located in a specific bucket.
  • Determine whether the location is a linked list.
  • If it is not a linked list, key、key 的 hashcode return the value based on  whether it is equal.
  • For a linked list, you need to traverse until the key and hashcode are equal and return the value.
  • If nothing is obtained, return null directly.

Base 1.8

Do not know the implementation of 1.7 Do you see any points that need to be optimized?

In fact, one obvious thing is:

When Hash serious conflict, the list becomes formed in the barrel longer and longer, so that when the query efficiency will become low; time complexity  O(N).

Therefore, 1.8 focuses on optimizing the query efficiency.

1.8 HashMap structure diagram:

Let's take a look at a few core member variables:

static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16 /** * The maximum capacity, used if a higher value is implicitly specified * by either of the constructors with arguments. * MUST be a power of two <= 1<<30. */ static final int MAXIMUM_CAPACITY = 1 << 30; /** * The load factor used when none specified in constructor. */ static final float DEFAULT_LOAD_FACTOR = 0.75f; static final int TREEIFY_THRESHOLD = 8; transient Node<K,V>[] table; /** * Holds cached entrySet(). Note that AbstractMap fields are used * for keySet() and values(). */ transient Set<Map.Entry<K,V>> entrySet; /** * The number of key-value mappings contained in this map. */ transient int size;

Similar to 1.7, there are several important differences:

  • TREEIFY_THRESHOLD The threshold used to determine whether the linked list needs to be converted into a red-black tree.
  • HashEntry is changed to Node.

The core composition of Node is actually the same as HashEntry in 1.7, which stores all  key value hashcode next data.

Let's take a look at the core method.

put method

Seems to be more complicated than 1.7, we dismantle step by step:

  1. To judge whether the current bucket is empty, the empty one needs to be initialized (it will be judged whether to initialize in resize).
  2. According to the hashcode of the current key, locate a specific bucket and determine whether it is empty. If it is empty, it means that there is no Hash conflict and a new bucket can be created directly at the current location.
  3. If the current bucket has a value (hash conflict), then it is necessary to compare key、key 的 hashcode whether the key in the current bucket  is equal to the written key. If they are equal, they are assigned to it  e. In step 8, the assignment and return will be performed uniformly.
  4. If the current bucket is a red-black tree, data must be written in the red-black tree.
  5. If it is a linked list, you need to encapsulate the current key and value into a new node and write it to the back of the current bucket (forming a linked list).
  6. Then determine whether the size of the current linked list is greater than the preset threshold, and when it is greater, it will be converted into a red-black tree.
  7. If the key is found to be the same during the traversal, exit the traversal directly.
  8. If  e != null it is equivalent to the same key, then the value needs to be overwritten.
  9. Finally, determine whether expansion is required.

get method

public V get(Object key) {
    Node<K,V> e;
    return (e = getNode(hash(key), key)) == null ? null : e.value; } final Node<K,V> getNode(int hash, Object key) { Node<K,V>[] tab; Node<K,V> first, e; int n; K k; if ((tab = table) != null && (n = tab.length) > 0 && (first = tab[(n - 1) & hash]) != null) { if (first.hash == hash && // always check first node ((k = first.key) == key || (key != null && key.equals(k)))) return first; if ((e = first.next) != null) { if (first instanceof TreeNode) return ((TreeNode<K,V>)first).getTreeNode(hash, key); do { if (e.hash == hash && ((k = e.key) == key || (key != null && key.equals(k)))) return e; } while ((e = e.next) != null); } } return null; }

The get method looks much simpler.

  • First, get the located bucket after key hashing.
  • If the bucket is empty, it returns null directly.
  • Otherwise, it is judged whether the key of the first position of the bucket (it may be a linked list, red-black tree) is the key of the query, and the value is returned directly.
  • If the first one does not match, it is determined whether the next one is a red-black tree or a linked list.
  • The red-black tree returns the value according to the tree search method.
  • Otherwise, it will traverse the matching return value according to the linked list.

From these two core methods (get / put), it can be seen that the large linked list is optimized in 1.8, and the query efficiency is directly improved after being modified into a red and black tree  O(logn).

However, the original problems of HashMap also exist, for example, it is prone to endless loops when used in concurrent scenarios.

final HashMap<String, String> map = new HashMap<String, String>();
for (int i = 0; i < 1000; i++) { new Thread(new Runnable() { @Override public void run() { map.put(UUID.randomUUID().toString(), ""); } }).start(); }

But why? Simple analysis.

After reading the above, I still remember that the resize() method will be called when the HashMap is expanded  , that is, the concurrent operation here is easy to form a circular list on a bucket; so that when a non-existing key is obtained, the calculated index is just under the circular list The mark will appear in an endless loop.

As shown below:

Traversal method

It is also worth noting that the traversal method of HashMap is usually the following:

Iterator<Map.Entry<String, Integer>> entryIterator = map.entrySet().iterator(); while (entryIterator.hasNext()) { Map.Entry<String, Integer> next = entryIterator.next(); System.out.println("key=" + next.getKey() + " value=" + next.getValue()); } Iterator<String> iterator = map.keySet().iterator(); while (iterator.hasNext()){ String key = iterator.next(); System.out.println("key=" + key + " value=" + map.get(key)); }

强烈建议Use the first EntrySet for traversal.

The first kind can take out the key value at the same time, the second kind needs to get the value through the key once, which is inefficient.

A brief summary of HashMap: Whether it is 1.7 or 1.8, it can be seen that the JDK does not perform any synchronization operations on it, so there will be problems with concurrency, and even the dead loop in 1.7 will cause the system to be unavailable (1.8 dead loop problem has been fixed).

Therefore, JDK has launched a special dedicated ConcurrentHashMap, which is located  java.util.concurrent under the package and is specifically used to solve concurrency problems.

I insist that the friends here have already laid the foundation of ConcurrentHashMap, and the following formally begins to analyze.

ConcurrentHashMap

ConcurrentHashMap is also divided into 1.7 and 1.8 versions, the two are slightly different in implementation.

Base 1.7

Let's take a look at the implementation of 1.7 first. The following is his structure diagram:

As shown in the figure, it is composed of Segment array and HashEntry. Like HashMap, it is still an array plus linked list.

Its core member variables:

/**
 * Segment array, when storing data, you first need to locate it in a specific segment.
 */
final Segment <K, V> [] segments;
transient Set<K> keySet;
transient Set<Map.Entry<K,V>> entrySet;

Segment is an internal class of ConcurrentHashMap, the main components are as follows:

static final class Segment < K, V> extends ReentrantLock implements Serializable { private static final long serialVersionUID = 2249069246763182397L; // Same as HashEntry in HashMap, the bucket that really stores data transient volatile HashEntry <K, V> [] table; transient int count; transient int modCount; transient int threshold; final float loadFactor;}

Look at the composition of HashEntry:

It is very similar to HashMap, the only difference is that the core data such as value and the linked list are   modified by Volatile , which guarantees the visibility during acquisition.

In principle: ConcurrentHashMap uses segmented lock technology, where Segment inherits from ReentrantLock. Unlike HashTable, whether it is put or get operations need to be synchronized, theoretically ConcurrentHashMap supports CurrencyLevel (number of segment arrays) concurrent threads. Whenever a thread occupies a lock to access a segment, it will not affect other segments.

Let's take a look at the core  put get method.

put method

public V put(K key, V value) {
    Segment<K,V> s;
    if (value == null) throw new NullPointerException(); int hash = hash(key); int j = (hash >>> segmentShift) & segmentMask; if ((s = (Segment<K,V>)UNSAFE.getObject // nonvolatile; recheck (segments, (j << SSHIFT) + SBASE)) == null) // in ensureSegment s = ensureSegment(j); return s.put(key, hash, value, false); }

The first is to locate the segment by key, and then put a specific put in the corresponding segment.

final V put(K key, int hash, V value, boolean onlyIfAbsent) {
    HashEntry<K,V> node = tryLock() ? null : scanAndLockForPut(key, hash, value); V oldValue; try { HashEntry<K,V>[] tab = table; int index = (tab.length - 1) & hash; HashEntry<K,V> first = entryAt(tab, index); for (HashEntry<K,V> e = first;;) { if (e != null) { K k; if ((k = e.key) == key || (e.hash == hash && key.equals(k))) { oldValue = e.value; if (!onlyIfAbsent) { e.value = value; ++modCount; } break; } e = e.next; } else { if (node != null) node.setNext(first); else node = new HashEntry<K,V>(hash, key, value, first); int c = count + 1; if (c > threshold && tab.length < MAXIMUM_CAPACITY) rehash(node); else setEntryAt(tab, index, node); ++modCount; count = c; oldValue = null; break; } } } finally { unlock(); } return oldValue; }

Although the value in HashEntry is modified with the volatile keyword, it does not guarantee concurrent atomicity, so lock operations are still required during put operations.

The first step will try to acquire the lock. If the acquisition fails, there must be other threads competing, then use  scanAndLockForPut() spin to acquire the lock.

  1. Try to spin to acquire the lock.
  2. If the number of retries is reached  MAX_SCAN_RETRIES , block lock acquisition is changed to ensure successful acquisition.

Let's take a look at the flow of put with the picture.

  1. Position the table in the current segment to the HashEntry through the hashcode of the key.
  2. Traverse the HashEntry. If it is not empty, determine whether the passed key is equal to the currently traversed key. If they are equal, the old value will be overwritten.
  3. If it is not empty, you need to create a new HashEntry and add it to the segment. At the same time, it will first determine whether it needs to be expanded.
  4. Finally, the lock of the current segment acquired in 1 will be released.

get method

public V get(Object key) {
    Segment<K,V> s; // manually integrate access methods to reduce overhead
    HashEntry<K,V>[] tab;
    int h = hash(key); long u = (((h >>> segmentShift) & segmentMask) << SSHIFT) + SBASE; if ((s = (Segment<K,V>)UNSAFE.getObjectVolatile(segments, u)) != null && (tab = s.table) != null) { for (HashEntry<K,V> e = (HashEntry<K,V>) UNSAFE.getObjectVolatile (tab, ((long)(((tab.length - 1) & h)) << TSHIFT) + TBASE); e != null; e = e.next) { K k; if ((k = e.key) == key || (e.hash == h && key.equals(k))) return e.value; } } return null; }

The get logic is relatively simple:

You only need to locate the key after passing the hash to a specific segment, and then locate the specific element through the hash.

Since the value attribute in HashEntry is decorated with the volatile keyword, which guarantees memory visibility, it is the latest value every time it is fetched.

The get method of ConcurrentHashMap is very efficient, because the whole process does not need to be locked .

Base 1.8

1.7 has solved the concurrency problem, and can support N Segments so many times of concurrency, but there are still problems with HashMap in version 1.7.

That is, the efficiency of query traversing the linked list is too low.

Therefore 1.8 made some adjustments to the data structure.

First look at the underlying structure:

Does it look similar to the 1.8 HashMap structure?

Among them, the original Segment lock is abandoned, and it is used  CAS + synchronized to ensure concurrency security.

Also changed the HashEntry of the data stored in 1.7 to Node, but the effect is the same.

All of them  val next are decorated with volatile to ensure visibility.

put method

Focus on the put function:

  • Calculate the hashcode according to the key.
  • Determine whether initialization is required.
  • f That is, the Node located for the current key. If it is empty, it means that data can be written at the current position. Use CAS to try to write. If it fails, the spin is guaranteed to succeed.
  • If it is at the current location  hashcode == MOVED == -1, it needs to be expanded.
  • If none of them are satisfied, the synchronized lock is used to write data.
  • If the number is greater than that,  TREEIFY_THRESHOLD it will be converted into a red-black tree.

get method

  • According to the calculated hashcode addressing, if it is on the bucket, then return the value directly.
  • If it is a red-black tree, then get the value according to the tree.
  • If it is not satisfied, iterate through the list to obtain the value.

1.8 There have been major changes in the data structure of 1.7. After using the red-black tree, the query efficiency can be guaranteed ( O(logn)), and even ReentrantLock has been cancelled and changed to synchronized. This shows that the optimized optimization of synchronized in the new version of JDK is in place.

to sum up

After reading the entire implementation of HashMap and ConcurrentHashMap in 1.7 and 1.8, I believe everyone's understanding of them should be more in place.

In fact, this is also the focus of the interview, the usual routine is:

  1. Talk about the HashMap you understand, and talk about the get put process.
  2. 1.8 What optimizations have been made?
  3. Is it thread safe?
  4. What problems can be caused by insecurity?
  5. How to solve? Are there any thread-safe concurrent containers?
  6. How is ConcurrentHashMap implemented? What are the differences between 1.7 and 1.8? Why do you do this?

I believe this series of questions can be returned to the interviewer after careful reading.

In addition to being asked in the interview, there are actually many applications .  The implementation of Cache in Guava mentioned earlier  is to use the idea of ​​ConcurrentHashMap.

At the same time, you can also learn the optimization ideas and concurrent solutions of JDK authors.

In fact, the premise of writing this article is from an Issues on GitHub  , and I hope everyone can participate and jointly maintain this project.

Guess you like

Origin www.cnblogs.com/xiaobin123/p/12729996.html