New understanding HashMap

Brief introduction

Java programmers HashMap is most frequently used for mapping (key, value) of the data processing type, which according to the data value stored hashCode bond, may be positioned in most cases directly to its value, and thus has fast access speed, but the traversal order is uncertain. HashMap key allows a maximum of one record is null, allowing multiple record is null, and is not thread-safe HashMap class, and may be used ConcurrentHashMap Collections of SynchronizedMap method makes HashMap with thread-safe capabilities. In JDK1.8 underlying HashMap achieving optimized red-black tree introduction, expansion optimization. That new understanding of what JDK1.8 HashMap, take a look at what has been done to optimize it.

Storage structure

To find out HashMap, HashMap first need to know what is it that is the storage structure; secondly it can figure out what to do, that is, how it functions to achieve. We all know that the data stored in the HashMap using the hash table to store the hash value based on the key, but there may be the same hash value between different key, this will lead to conflict; the hash table to resolve the conflict can be an open chain address law and address method to solve the problem, Java the HashMap using the hash chain address law to resolve the conflict, plus an array of linked list is a combination of simple terms. Are on a list structure for each array element, when the stored data if the generated hash collision, to obtain the array index, the data on the list corresponding to the index of the element. Here we think about a problem, even if more reasonable hashing algorithm design, inevitably there will be too long zipper case, once the zipper appear too long, it will seriously affect the performance of HashMap in JDK1.8 version of the data structure to do further optimization, introduction of red-black tree; when the list is too long (more than 7 by default), the list is converted to red-black tree, as shown in FIG.

Important fields

HashMap is based upon a hash of an access key, the quality and performance HashMap hashing algorithm has a direct relationship, the result uniformly dispersed hash algorithm, the smaller the probability of a hash collision, map the existence extraction efficiency will be higher. Of course, the size of hash and an array of relationship, if the hash array is large, even if poor hashing algorithm will be scattered, if the hash array is small, even better hashing algorithm will occur more frequently the collision, so we need to weigh the cost of space and time, to find more balanced value.

JDK1.8 version also weigh the time, cost and space efficiency of the previous version made a lot of optimization; not only the data structure is optimized, in addition to the optimization of the expansion were also greatly improve the performance of HashMap . Let us together through the source code to see what the specific implementation.

Let's look at the more important HashMap in several properties.

//默认的初始容量，必须是2的幂次方.
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4;

//所能容纳 key-value 个数的极限，当Map 的size > threshold 会进行扩容 。容量 * 扩容因子
int threshold;

//hashMap最大的容量
static final int MAXIMUM_CAPACITY = 1 << 30;

//HashMap 默认的桶数组的大小
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // 16

//默认的加载因子.当容量超过 0.75*table.length 扩容
static final float DEFAULT_LOAD_FACTOR = 0.75f;

//HashMap的加载因子，在构造器中指定的.
final float loadFactor;

//链表节点数大于8个链表转红黑树
static final int TREEIFY_THRESHOLD = 8;

//红黑树节点转换链表节点的阈值, 6个节点转
static final int UNTREEIFY_THRESHOLD = 6;

//以Node数组存储元素，长度为2的次幂。
transient Node<K,V>[] table;

// 转红黑树, table的最小长度
static final int MIN_TREEIFY_CAPACITY = 64; 

// 链表节点, 继承自Entry
static class Node<K,V> implements Map.Entry<K,V> {  
    final int hash;
    final K key;
    V value;
    Node<K,V> next;
    // ... ...
}

// 红黑树节点
static final class TreeNode<K,V> extends LinkedHashMap.Entry<K,V> {
    TreeNode<K,V> parent;  // red-black tree links
    TreeNode<K,V> left;
    TreeNode<K,V> right;
    TreeNode<K,V> prev;    // needed to unlink next upon deletion
    boolean red;
   
    // ...
}
复制代码

HashMap of attributes is quite well understood. In fact, here there is a question why the default length of the hash table is an array of bucket 16, and the length must be 2 ^ n it?

Here, we first talk about why the length of the hash array is 2 ^ n.

In fact, whether in the JDK1.7 or JDK1.8, calculate key index position is through hash & (length-1) is calculated come.

We should know the hash% length equivalent to hash & (length - 1).

假如有一个key的哈希值二进制如下：这里我们就只看低位。
hahsCode         0010 0011       ———————转成十进制—————————>        35           
&                                                             %
(length-1)=15:   0000 1111                                length = 16  
-----------------------------------------------------------------------------------------------
             （二进制） 0011  = （十进制）3                            3
复制代码

Why not hash% length to calculate the index position, use hash & (length -1) to calculate it? Underlying computer is a binary number are computed and stored, computer & underlying close operation, operation efficiency compared to the% should be faster.

Why length must be 2 ^ n it?

hahsCode         0010 0011                     0010 1111    
&                                     
(length-1)=15:   0000 1111    (length-1) = 13: 0000 1111  
----------------------------------------------------------------------------------------------
                      0011                          1111 
复制代码

hahsCode         0010 1110                     1110 1100  
&                                     
(length-1)=13:   0000 0101    (length-1) = 13: 0000 0101  
----------------------------------------------------------------------------------------------
                      0100                          0100 
复制代码

In fact, we can see that when the length of the hash array is 2 ^ n, length - 1 binary code are all set to 1, so the index is completely dependent on the low value of the hash value, and the probability of conflict to to lower the probability of n-th power capacity of not more than 2, the index position is entirely dependent on the low hash value, so long as the hash value of uniform distribution, then the probability of conflict will be much lower, and therefore length is 2 n times party better.

Second, when the length of n-th power of 2, is also convenient to make expansion, JDK1.8 in the expansion algorithms optimized method is also very clever. It will be mentioned at the time expansion method.

Function to achieve

Determine the index position

Whether to add, delete, find, need to locate the position of an array of hash buckets, have said before HashMap data structure is a combination of arrays and linked lists, so we certainly hope that this position HashMap elements inside evenly distributed as possible, and to try so that the number of elements at each position is only one, then when we find this position when using hash algorithm, we can immediately know the location of the corresponding elements of what we want, we do not have to traverse the list query, greatly improving the efficiency of queries.

tableSizeFor () this method is to ensure initialized HashMap () when the size of the hash bucket array will always be 2 ^ n.

static final int tableSizeFor(int cap) {
        int n = cap - 1;
        n |= n >>> 1;
        n |= n >>> 2;
        n |= n >>> 4;
        n |= n >>> 8;
        n |= n >>> 16;
        return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
        
  /**
  假如现在传入的参数cap = 3
  那 n = 2 对应的二进制 为 10 
  n  = n | n>>>1  10|01  得到 11
  ....
  ....
  n = 11(二进制)  = (10进制) 3 
  最后return 返回的是4
  */
}
复制代码

//JDK1.8的Hash算法
static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

// JDK 1.7的Hash算法
 static final int hash(int h) {
    h ^= k.hashCode(); 
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
 }

//索引位置
index = hash & (length-1);

//JDK1.7 使用hashCode() + 4次位运算 + 5次异或运算（9次扰动）
//JDK 1.8 简化了hash函数 = 只做了2次扰动 = 1次位运算 + 1次异或运算。
复制代码

In the implementation of JDK1.8, the optimization algorithm high computation by hashCode () High Low 16-bit 16-bit XOR implemented: (h = k.hashCode ()) ^ (h >>> 16), the main from the speed, efficiency and quality to consider, it is compared to the JDK1.7, JDK1.8 reduce the number of hash function disturbance, it can be considered to optimize the hash algorithm. Doing so in a smaller capacity HashMap time, but also to ensure taking into account the level of computing Hash Bit are involved in, while there is not much overhead.

假如有一个key的哈希值二进制如下
hahsCode               0000 0000 0011 0011 0111 1010 1000 1011

hahsCode>>>16          0000 0000 0000 0000 0000 0000 0011 0011
 ———————————————————————————————————————————————————————————————
位或^运算               0000 0000 0011 0011 0111 1010 1011 1000
 &
HashMap.size()-1       0000 0000 0000 0000 0000 0000 0000 1111
 ———————————————————————————————————————————————————————————————
                       0000 0000 0000 0000 0000 0000 0000 1000 转成十进制是 8
复制代码

put method

From the Internet to find a flow chart, I feel very good, directly take over with, and hey .... picture is relatively clear. Watching the flow chart, combined with the source code to see to understand.

final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        
        //判断哈希桶数组是否为空。
        if ((tab = table) == null || (n = tab.length) == 0)
        
         //如果哈希桶数组为空，对其进行初始化。默认的桶数组大小为16
        n = (tab = resize()).length;
            
        //如果桶数组不为空，得到计算key的索引位置，判断此索引所在位置是否已经被占用了。
        if ((p = tab[i = (n - 1) & hash]) == null)
        
        //如果没有被占用，那就封装成Node节点，放入哈希桶数组中。
            tab[i] = newNode(hash, key, value, null);
        else {
            Node<K,V> e; K k;
            //如果要插入的Node节点已经存在，那就将旧的Node替换。
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
                e = p;
            //如果不存在，那就插入，判断插入的节点是不是树节点。
            else if (p instanceof TreeNode)
            
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
                
            else {
            //如果是普通节点，循环哈希桶对应的链表,将节点插入到链表末尾
                for (int binCount = 0; ; ++binCount) {
                    
                    if ((e = p.next) == null) {d
                    
                        p.next = newNode(hash, key, value, null);
                        
                        //如果链表的长度大于7，就把节点转成树节点
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                            treeifyBin(tab, hash);
                        break;
                    }
                    //如果链表中节点已经存在，那就将旧的节点替换。
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    p = e;
                }
            }
            
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        //如果超过了扩容的临界点，就进行扩容。
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }
复制代码

get method

get method with respect to the method may put simply, a look through the source code can understand. Ado, directly on the code is to look at it.

 public V get(Object key) {
        Node<K,V> e;
        return (e = getNode(hash(key), key)) == null ? null : e.value;
}

final Node<K,V> getNode(int hash, Object key) {

        Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
        //哈希桶数组不为空，且根据传入的key计算出索引位置的Node不为空。
        if ((tab = table) != null && (n = tab.length) > 0 &&
            (first = tab[(n - 1) & hash]) != null) {
            //如果计算出来的第一个哈希桶位置的Node就是要找的Node节点，直接返回。
            if (first.hash == hash && // always check first node
                ((k = first.key) == key || (key != null && key.equals(k))))
                return first;
            
            if ((e = first.next) != null) {
                //如果是树节点，直接通过树节点的方式查找。
                if (first instanceof TreeNode)
                    return ((TreeNode<K,V>)first).getTreeNode(hash, key);
                //循环遍历哈希桶所在的链表
                do {
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        return e;
                } while ((e = e.next) != null);
            }
        }
        return null;
  }
复制代码

Expansion mechanism (resize)

final Node<K,V>[] resize() {
        Node<K,V>[] oldTab = table;
        int oldCap = (oldTab == null) ? 0 : oldTab.length;
        int oldThr = threshold;
        int newCap, newThr = 0;
        //如果老的HashMap容量不为空
        if (oldCap > 0) {
            //如果容量大于或者等于这个扩容的临界点
            if (oldCap >= MAXIMUM_CAPACITY) {
            //修改阈值为2^31-1
                threshold = Integer.MAX_VALUE;
                return oldTab;
            }  
            // 没超过最大值，就扩充为原来的2倍
            else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                     oldCap >= DEFAULT_INITIAL_CAPACITY)
                newThr = oldThr << 1; // double threshold
        }
        //如果老的容量为0, 老的阈值大于0, 是因为初始容量没有被放入阈值，则将新表的容量设置为老表的阈值
        else if (oldThr > 0) 
            newCap = oldThr;
        else {
        //老表的容量为0, 老表的阈值为0，这种情况是没有传初始容量，将阈值和容量设置为默认值
            newCap = DEFAULT_INITIAL_CAPACITY;
            newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
        }
        // 计算新的resize上限
        if (newThr == 0) {
            float ft = (float)newCap * loadFactor;
            newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                      (int)ft : Integer.MAX_VALUE);
        }
       // 将当前阈值设置为刚计算出来的新的阈值，定义新表，容量为刚计算出来的新容量。将旧Hash桶中的元素，移动到新的Hash数组中。
        threshold = newThr;
        @SuppressWarnings({"rawtypes","unchecked"})
        Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
        table = newTab;
        // 如果原来的容量不为空，把每个bucket都移动到新的buckets中
        if (oldTab != null) {
            for (int j = 0; j < oldCap; ++j) {
                Node<K,V> e;
                
                if ((e = oldTab[j]) != null) {
                // 将老表的节点设置为空, 以便垃圾收集器回收空间
                    oldTab[j] = null;
                    
                    //哈希桶位置只有一个节点。rehash之后再放到newTab里面去
                    if (e.next == null)
                        newTab[e.hash & (newCap - 1)] = e;
                    else if (e instanceof TreeNode)
                        //如果是红黑树节点，则进行红黑树的重hash分布
                        ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                    else { // preserve order
                        //如果是普通的链表节点，则进行普通的重hash分布
                        Node<K,V> loHead = null, loTail = null;
                        Node<K,V> hiHead = null, hiTail = null;
                        Node<K,V> next;
                        do {
                            next = e.next;
                    //如果要移动节点的hash值与老的容量进行与运算为0,则扩容后的索引位置跟老表的索引位置一样
                           if ((e.hash & oldCap) == 0) {
                                if (loTail == null)
                                    loHead = e;
                                else
                                    loTail.next = e;
                                loTail = e;
                            }
                    //如果e的hash值与老表的容量进行与运算不为0,则扩容后的索引位置为:老表的索引位置＋oldCap
                            else {
                                if (hiTail == null)
                                    hiHead = e;
                                else
                                    hiTail.next = e;
                                hiTail = e;
                            }
                        } while ((e = next) != null);
                        if (loTail != null) {
                            loTail.next = null;
                            newTab[j] = loHead;
                        }
                        if (hiTail != null) {
                            hiTail.next = null;
                            newTab[j + oldCap] = hiHead;
                        }
                    }
                }
            }
        }
        return newTab;
    }
复制代码

In the source code has this to say (e.hash & oldCap) == 0, how to understand this, let's look at the following

假设扩容之前 数组大小为16
假如有两个key：
key1(hash&hash>>>16) 0000 0000 0011 0011 0111 1010 1011 1000
key2(hash&hash>>>16) 0000 0000 0011 0011 0111 1010 1010 1000
          &
    length-1 = 15    0000 0000 0000 0000 0000 0000 0000 1111
——————————————————————————————————————————————————————————————————
                                                  key1: 1000 转成十进制 8
                                                  key2: 1000 转成十进制 8
                                                  
哈希冲突的两个key，在扩容到32之后
key1(key的hash&hash>>>16) 0000 0000 0011 0011 0111 1010 1011 1000
key2(key的hash&hash>>>16) 0000 0000 0011 0011 0111 1010 1010 1000
           &
         length-1 = 31    0000 0000 0000 0000 0000 0000 0001 1111
——————————————————————————————————————————————————————————————————
                                                   key1:   1 1000 转乘二进制 24=16+8
                                                   key2:   0 1000 转乘二进制 8
复制代码

Through the above we can see, the original two key in the same position by position after the expansion either in the original position, or in oldCap + original position. This eliminates the need to recalculate the hash so as to achieve JDK1.7, just take a look at the new original hash value that is 1 or 0 bit like, is 0, then the index has not changed, then the index is 1 with "original index + oldCap ". But also more fully explains why HashMap capacity must be 2 to the power of n.

JDK1.8 this design is very clever indeed, not only eliminates the need for time to recalculate the hash value, and at the same time, due to the new 1bit is 0 or 1 can be considered random, resize the process, even before the thus node conflict is dispersed to a new bucket.

to sum up

Expansion is a particularly expensive operation performance, so when we use in the HashMap time, the estimated size of the map, to initialization of a value substantially avoid frequent expansion map.
HashMap is not thread-safe, do not use HashMap in a concurrent environment, use ConcurrentHashMap or SynchronizedMap
Load factor can be modified, it can be greater than 1, but it is recommended not to be easily modified.