HashMap source code analysis (jdk1.8, to ensure you can read)

Now those manufacturers who interview them, basically will ask some of the questions about HashMap, but also in the development of this collection are also often used to. So I spend a lot of time to research and analysis to write this article. This article is based on jdk1.8 to analysis. Lengthy, but are gradual. I believe you will be patient and read harvest.

First, with problem analysis

This article, hoping to solve the following problem.

What is the underlying data structure (1) HashMap is?

What (2) HashMap in CRUD operations to achieve the bottom of principle?

How (3) HashMap to achieve expansion?

How (4) HashMap hash conflict is resolved?

(7) HashMap Why not thread-safe?

Here we take these questions, uncover HashMap veil.

Second, understanding HashMap

HashMap first in jdk1.2 start appearing until jdk1.7 has not changed much. But the jdk1.8 suddenly made a lot of changes. One of the most notable change is:

Before storage structure is an array jdk1.7 + chain, to become jdk1.8 list array + + red-black tree.

In addition, HashMap is not thread safe, that while certain elements HashMap in multiple threads CRUD operations when there is no guarantee the consistency of data.

Step by step analysis Here we begin.

Third, in-depth analysis HashMap

1, the underlying data structure

For a comparative analysis, we first give a jdk1.7 memory structure of FIG.

We can see from the chart, in jdk1.7, it is the first element in the array inside a growing number of data elements stored later, so it was a list, for each element in the array are there may be a linked list to store the elements. This is the famous "zipper-style" storage method.

So a few years later more and more storage elements, the list getting longer and longer, in time to find an element not only failed to improve efficiency (the list is not suitable look for additions and deletions), but rather dropped a lot, so Article on this list was an improvement. How to improve it? It is to put the list into a tree looking for, yes that is a red-black tree. Thus the HashMap data structures following this becomes.

We find that optimization is part of the list structure into a red-black tree. The original jdk1.7 advantage is the high efficiency additions and deletions, so in jdk1.8 time, additions and deletions not only high efficiency, but also enhance the search efficiency.

Note: not to say that turned into a red-black tree efficiency will certainly improve, and only when the length of the list of not less than 8, and the length of the array is not less than 64, the list will be converted into red-black tree,

The question: What is the red-black trees?

Red-black tree is a self-balancing binary search trees, red-black tree that is search efficiency is very high, look for efficiency will be reduced to o (logn) from the list of o (n). If you have not understood the red-black tree, then it does not matter, you'll remember red-black tree search efficiency is very high on OK.

Question two: Why do not all of a sudden the whole list into a red-black trees?

Meaning the problem is this, that is why we have to wait until the time is equal to the length of the list is greater than 8, was transformed into a red-black tree? Here can be interpreted in two ways

(1) red-black tree structure than the complex structure of the list, not much in the node of the list when it seems from the overall performance of the array + + red-black tree structure list may not necessarily be higher than the structural performance of the array + list. Like overkill meaning.

(2) HashMap frequent expansion, will continue to cause splits and restructuring the bottom of the red-black tree, which is very time-consuming. Therefore, the length of the list is a long time into a red-black tree will significantly improve efficiency.

OK, here we believe the underlying data structures to have an understanding of hashMap. Now with the above chart, we look at how a storage element.

2, the storage element put

We are a storage element, when most of the following used in this way.

public class Test {
	public static void main(String[] args) {
		HashMap<String, Integer> map= new HashMap<>();
		//存储一个元素
		map.put("张三", 20);
	}
}

复制代码

Here HashMap <String, Integer>, the first parameter is a bond, is the value of the second parameter, together called the key-value pair. Storage only when needed to call the put method. The principle underlying that is kind of how it? Given here is a first flow chart

Above this process, do not know if you can see, three red writing that determination block, a turning point, we use words to sort out this process:

(1) The first step: put method call passing key-value pairs

(2) Step 2: Use a hash value hash algorithm

(3) Step 3: determining the position stored in the hash value, and determines whether the position of other key clashes

(4) Fourth step: If there is no conflict, can be directly stored in an array

(5) Fifth step: If the conflict, but also to determine at this time what data structure?

(6) The sixth step: in this case, if the data structure is a red-black tree, then directly into a red-black tree

(7) Seventh step: At this time, if the data structure is a linked list, it is determined whether or equal to 8 after insertion is larger than

(8) Eighth step: after insertion of greater than 8, it is necessary to adjust the red-black tree, the insertion

(9) Step 9: After insertion of no greater than 8, then it can be inserted directly into the tail of the linked list.

The above data is inserted into the entire process, just look at the process is not enough, we also need to go deep into the bottom of the source code to see how the code is written in accordance with this process.

Mouse focus in put method above, click on F3, we will be able to put into the source code. Take a look:

public V put(K key, V value) {
     return putVal(hash(key), key, value, false, true);
}
复制代码

In other words, put the method call is actually putVal method. putVal method takes five parameters:

(1) The first parameter hash: calling a method for calculating the hash value of hash

(2) The second parameter key: that is, we pass key value, which is the case of John Doe

(3) The third parameter value: the value that we pass value, which is the case of 20

(4) the fourth parameter onlyIfAbsent: that is, when the same key value, without modifying the existing

(5) The fifth parameter evict: If false, then the array is in create mode, it is generally true.

Know the meaning of these five parameters, we enter into this putVal method.

    final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        //第一部分
        if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
        //第二部分
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);
        //第三部分
        else {
            Node<K,V> e; K k;
            //第三部分第一小节
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                e = p;
            //第三部分第二小节
            else if (p instanceof TreeNode)
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            //第三部分第三小节
            else {
                for (int binCount = 0; ; ++binCount) {
                    //第三小节第一段
                    if ((e = p.next) == null) {
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            treeifyBin(tab, hash);
                        break;
                    }
                    //第三小节第一段
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    //第三小节第三段
                    p = e;
                }
            }
            //第三部分第四小节
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        //第四部分
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }
复制代码

At first glance, this code is absolutely no desire to read on, for the first time to see when the real nausea and vomiting, but began to draw a flow chart in conjunction with the coming of analysis, we believe it will be much better. We split the code (as a whole divided into four parts):

(1) Node <K, V> [] tab in the tab array is represented. Node <K, V> p represents the p-node is currently inserted

(2) Part I:

if ((tab = table) == null || (n = tab.length) == 0)
       n = (tab = resize()).length;
复制代码

This part represents the mean if the array is empty, then create a new array using the resize method. Here resize method will not speak out, in the next section when the expansion will be mentioned.

(3) Part 2:

if ((p = tab[i = (n - 1) & hash]) == null)
      tab[i] = newNode(hash, key, value, null);
复制代码

i represents an inserted position in the array, is calculated as (n - 1) & hash. It should be determined whether the insertion position of conflict, if the newNode not conflict directly, can be inserted into the array, and this flowchart corresponds to the first block is determined.

If you insert a hash value of the conflict, it would go to the third part, dealing with conflict

(4) The third part:

        else {
            Node<K,V> e; K k;
            //第三部分a
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                e = p;
            //第三部分b
            else if (p instanceof TreeNode)
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            //第三部分c
            else {
                for (int binCount = 0; ; ++binCount) {
                    //第三小节第一段
                    if ((e = p.next) == null) {
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            treeifyBin(tab, hash);
                        break;
                    }
                    //第三小节第一段
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    //第三小节第三段
                    p = e;
                }
            }
            //第三部分d
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
复制代码

We will see that conflict is really in trouble, but fortunately we were part of this division

a) The third part of the first section:

if (p.hash == hash 
     &&((k = p.key) == key || (key != null && key.equals(k))))
     e = p;
复制代码

Here determination table [i] is the key element is inserted as if it is directly inserted into the same p values ​​replace the old values ​​of e.

b) a third portion of the second section:

else if (p instanceof TreeNode)
       e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
复制代码

Analyzing the inserted red-black tree data structure is a linked list or, where represents a red-black tree if, it is directly putTreeVal into red-black tree. This flow chart and the inside of the second frame corresponding to the determination.

c) a third portion of the third section:

//第三部分c
else {
     for (int binCount = 0; ; ++binCount) {
        //第三小节第一段
         if ((e = p.next) == null) {
              p.next = newNode(hash, key, value, null);
              if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                  treeifyBin(tab, hash);
                  break;
         }
         //第三小节第一段
         if (e.hash == hash &&
               ((k = e.key) == key || (key != null && key.equals(k))))
              break;
         //第三小节第三段
         p = e;
    }
}
复制代码

If the data structure is a linked list, an array must first traverse table exists, if there is no direct newNode (hash, key, value, null). If there is a new value directly replace the old one.

Note that: the element is present and is inserted at the end of the list when determines binCount> = TREEIFY_THRESHOLD - 1. Analyzing the current length of the list is greater than a threshold of 8, is greater than it would if the current list into a red-black tree, is treeifyBin. This is also the third and the flowchart correspond to the decision block.

(5) Part IV:

if (++size > threshold)
        resize();
afterNodeInsertion(evict);
return null;
复制代码

After successful insertion, but also to determine what actually exists on the number of key size is greater than the threshold value threshold. If it is greater than the expansion began.

3, expansion

Why expansion it? It is clear that the current capacity is not enough, that is, put too many elements. To this end we first give a flow diagram, again for analysis.

This expansion is relatively simple, HaspMap expansion that is the first to calculate the new hash table capacity and new capacity threshold, then initiates a new hash table, the old key to re-map the new hash table. If it comes to red-black tree in the old hash table, then mapped to the new hash table also relates to the split red-black tree. The entire process line with a capacity of our normal expansion process, we analyzed binding flowchart Code:

    final Node<K,V>[] resize() {
        Node<K,V>[] oldTab = table;
        int oldCap = (oldTab == null) ? 0 : oldTab.length;
        int oldThr = threshold;
        int newCap, newThr = 0;
        //第一部分:扩容
        if (oldCap > 0) {
            if (oldCap >= MAXIMUM_CAPACITY) {
                threshold = Integer.MAX_VALUE;
                return oldTab;
            }
            else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                     oldCap >= DEFAULT_INITIAL_CAPACITY)
                newThr = oldThr << 1; // double threshold
        }
        //第二部分:设置阈值
        else if (oldThr > 0) // initial capacity was placed in threshold
            newCap = oldThr;
        else {               // zero initial threshold signifies using defaults
            newCap = DEFAULT_INITIAL_CAPACITY;
            newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
        }
        if (newThr == 0) {
            float ft = (float)newCap * loadFactor;
            newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                      (int)ft : Integer.MAX_VALUE);
        }
        threshold = newThr;
        @SuppressWarnings({"rawtypes","unchecked"})
            Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
        table = newTab;
        //第三部分:旧数据保存在新数组里面
        if (oldTab != null) {
            for (int j = 0; j < oldCap; ++j) {
                Node<K,V> e;
                if ((e = oldTab[j]) != null) {
                    oldTab[j] = null;
                    //只有一个节点,通过索引位置直接映射
                    if (e.next == null)
                        newTab[e.hash & (newCap - 1)] = e;
                    //如果是红黑树,需要进行树拆分然后映射
                    else if (e instanceof TreeNode)
                        ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                    else {
                         //如果是多个节点的链表,将原链表拆分为两个链表
                        Node<K,V> loHead = null, loTail = null;
                        Node<K,V> hiHead = null, hiTail = null;
                        Node<K,V> next;
                        do {
                            next = e.next;
                            if ((e.hash & oldCap) == 0) {
                                if (loTail == null)
                                    loHead = e;
                                else
                                    loTail.next = e;
                                loTail = e;
                            }
                            else {
                                if (hiTail == null)
                                    hiHead = e;
                                else
                                    hiTail.next = e;
                                hiTail = e;
                            }
                        } while ((e = next) != null);
                        //链表1存于原索引
                        if (loTail != null) {
                            loTail.next = null;
                            newTab[j] = loHead;
                        }
                        //链表2存于原索引加上原hash桶长度的偏移量
                        if (hiTail != null) {
                            hiTail.next = null;
                            newTab[j + oldCap] = hiHead;
                        }
                    }
                }
            }
        }
        return newTab;
    }
复制代码

This code is the same amount of disgusting, but we still segmented to analyze:

(1) Part I:

//第一部分:扩容
if (oldCap > 0) {
      if (oldCap >= MAXIMUM_CAPACITY) {
           threshold = Integer.MAX_VALUE;
           return oldTab;
      }
      else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
          oldCap >= DEFAULT_INITIAL_CAPACITY)
            newThr = oldThr << 1; // double threshold
}
复制代码

The code is also Nengkanmingbai: First, if the capacity of the array exceeds the maximum, then the threshold is set directly to the maximum value of an integer, and if not exceeded, then 2-fold expansion of the original, to be noted here is oldThr << 1, shift operation to achieve.

(2) Part II:

//第二部分:设置阈值
else if (oldThr > 0) //阈值已经初始化了,就直接使用
      newCap = oldThr;
else {    // 没有初始化阈值那就初始化一个默认的容量和阈值
      newCap = DEFAULT_INITIAL_CAPACITY;
      newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
}
if (newThr == 0) {
      float ft = (float)newCap * loadFactor;
      newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                      (int)ft : Integer.MAX_VALUE);
}
//为当前的容量阈值赋值
threshold = newThr;
@SuppressWarnings({"rawtypes","unchecked"})
Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
table = newTab;
复制代码

First else if means that if the first threshold has been initialized, and then directly use the old threshold. Then the second represents else if not initialized, it initializes a new array capacity and a new threshold.

(3) a third portion

The third part is also very complicated, is to copy the old data to the new array inside. There should be noted that there are the following situations:

A: After the expansion, if the hash value involved in computing the new bit = 0, then the expansion element in the original position = position

B: After the expansion, if the hash value of the new bit involved in computing = 1, then the expansion element position = position + original position after the old expansion.

What bit hash value of the new operations are involved in it? We hash value into a binary number, add participate bit computing is the reciprocal of the fifth.

And there are a very good design, after expansion of 2 times the original length of the hash table, the hash table then divided into two halves, divided into low and high, if the original list of key-value pairs can, on the lower half half in the high, but by e.hash & oldCap == 0 to judge, this judge what are the advantages of it?

For example: n = 16, 10000 binary, bit 5 is 1, e.hash & oldCap is equal to 0. 5 bit depends e.hash is 0 or 1, which is equivalent to a 50% probability of discharge in the new hash table low, the probability of 50% on the new hash table high.

OK, this step is basically even finished this part of the expansion, there is a problem is not resolved, that the principle of storage made myself clear, how much storage elements of the expansion also understand, address conflicts emerged after expansion How to do it?

4, address conflict resolution

Premise address conflict resolution is to compute the hash value there has been repeated, we take a look at the HashMap, it is how to calculate the hash value.

static final int hash(Object key) {
     int h;
     return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
复制代码

Code is super simple, hash value is actually through hashcode 16 XOR to and why to use the XOR operation it? Draw a picture you will understand:

That is, by the XOR operation can be calculated from the hash uniform, not prone to conflict. But it happens there is a conflict phenomenon, this time how to solve it?

The method in a data structure, we deal with conflict hash commonly used are: development addressing method, and then hashing chain address law, the establishment of public overflow area. The method hashMap conflict is handled hash chain address law.

The basic idea of ​​this approach is that all elements i of the hash address is configured of a single linked list called a synonym chain, a single list head pointer and the presence of the i-th unit of the hash table, and thus to find, insert and delete mainly in the synonym chain. Chain address law applicable to the case of frequent insertions and deletions.

I believe we all slow connection, address conflicts arise when, one by one, arranged in a chain on OK. Just echoes HashMap underlying data structure.

5, a configuration HashMap

The above problems may arise, we have already explained, about his constructor was long overdue. Let's talk about his good constructor:

He's a total of four construction method:

First:

public HashMap() {
     this.loadFactor = DEFAULT_LOAD_FACTOR; 
}
复制代码

the second:

public HashMap(int initialCapacity) {
     this(initialCapacity, DEFAULT_LOAD_FACTOR);
}
复制代码

The third:

public HashMap(Map<? extends K, ? extends V> m) {
    this.loadFactor = DEFAULT_LOAD_FACTOR;
    putMapEntries(m, false);
}
复制代码

the fourth:

   public HashMap(int initialCapacity, float loadFactor) {
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);
        this.loadFactor = loadFactor;
        this.threshold = tableSizeFor(initialCapacity);
    }
复制代码

These four constructors Obviously the fourth most trouble, we have to analyze the fourth constructor, the other three will naturally understand. Above, there are two new terms: loadFactor and initialCapacity. We analyze one by one:

(1) initialCapacity initial capacity

The official asked us to enter N times the value of a power of 2, for example 2,4,8,16 and so these, but a sudden we are not careful, enter a 20-how to do? It does not matter, the virtual machine based on the values ​​you enter, look for a recent 20 times the value of N 2 from the power of, say 16 closest to him, he took 16 for the initial capacity.

(2) loadFactor load factor

Load factor, the default value is 0.75. The load factor indicating the degree using a hash table space, there is such a formula: initailCapacity * loadFactor = HashMap capacity. Therefore, the higher the degree of filling the greater the load factor of the hash table, which is able to accommodate more elements, elements and more, a big list, so in this case the index will reduce efficiency. On the contrary, the smaller the load factor is the amount of data in the list of the more sparse at this time will cause bad space fee, but this time the index high efficiency.

The default value is 0.75 Why would it? We interception of a jdk document:

English is not good to see people I really look ignorant force, but the good news about the meaning can understand. Look at the third line Poisson_distribution is not that the Poisson distribution thing. But the key is

When the bucket reaches an element 8, the probability has become very small, i.e. 0.75 as the load factor, the length of each chain collision position more than 8 is almost impossible. When the bucket reaches an element 8, the probability has become very small, i.e. 0.75 as the load factor, the length of each chain collision position more than 8 is almost impossible.

6, HashMap Why not thread-safe?

Want to solve this problem, the answer is simple, because the source code inside the method are all non-thread-safe way, you do not find such a keyword synchronized. We can not guarantee thread safety. So there ConcurrentHashMap.

I write to you to finally be finished some of the core content. Of course HashMap which involves a lot of these questions surface, can not be exhaustive. If omissions, we will supplement in the future. Welcome criticism.

Guess you like

Origin juejin.im/post/5d5d54d6f265da03b46bf349