HashMap detailed explanation and source code analysis

Get into the habit of writing together! This is the third day of my participation in the "Nuggets Daily New Plan · April Update Challenge", click to view the details of the event .

foreword

In the Java container written earlier , some students reported too much content, and they were impatient after seeing it. I thought about it and decided to separate it HashMapout and explain it separately, because HashMapit is really important. In this article, we will explain the relevant knowledge points in the previous section HashMapseparately.

If you need to know more about Mapthe system, you can go to this article: juejin.cn/post/708452…

HashMap

HashMapis Mapan important implementation class in , it is a hash table, and the stored content is a key-value pair (key=>value) map. HashMapis not thread safe. HashMapThe keys and values ​​allowed to be stored nullin the key are unique.

Before JDK1.8, HashMapthe underlying data structure was a pure array + linked list structure. Since the array has the characteristics of fast reading and slow additions and deletions, while the linked list has the characteristics of slow reading and fast additions and deletions, HashMapthe combination of the two, and no synchronization lock is used for decoration, its performance is better. Arrays are HashMapthe main body, and linked lists are introduced to resolve hash collisions. The specific method to solve the hash collision here is: the zipper method . We continue the discussion below.

HashMapThe default array length is 16, and each array stores the head node of the linked list. The expanded capacity is doubled each time. HashMapAlways use a power of 2 as the size of the hash table, and why it is a power of 2 will be discussed below. HashMapAfter keythe hashCodepassed hashfunction is processed, the hash value is obtained, and then (n - 1) & hash(key.hashCode)the element subscript is obtained by (n is the length of the array). The algorithm here is equivalent to: hash(key.hashCode) % n, the efficiency of using bit operation is higher than that of directly using modulo operation. The perturbation function after JDK1.8 hash()has been improved compared with the previous one, but the principle is the same.

In order to understand the zipper method more intuitively , I drew an HashMaparray + linked list storage structure diagram.

zipper method

Zipper method: combine arrays and linked lists, each array is a linked list, and the array elements (element 1, element 2, etc. in the figure) store the head node of the linked list. Then according to the algorithm mentioned above, the keysubscript of the array element is determined after the length is perturbed, and stored in the corresponding place in the array. In some cases, hash()the hash value calculated by the perturbation function may be the same, then the "zipper method" will store the value of the hash conflict in the linked list. This is called the "zipper method".

但单纯使用数组+链表有个问题。因为在极端情况下,我们添加入HashMap的不同valuekey计算出来的哈希值可能相同,那么就会一直存放到一条链表上,导致链表长度过长,时间复杂度有可能从O(1)恶化为O(n)。由于链表的查询需要从头遍历,链表长度过长会造成大量的性能损失,所以在JDK1.8之后,HashMap的底层数据结构增加了红黑树,它的存在更好地解决了哈希冲突,时间复杂度从O(n)提高到O(logn)。具体方法是:当链表长度大于阈值(默认为8)时,会将链表转化成红黑树以提高搜索的效率,防止链表过长而导致搜索效率低下。不过,在数组长度小于默认值64是,HashMap会优先选择数组扩容,而不会转化为红黑树。

下面我们开始进行HashMap源码分析。

源码分析

此处代码均为JDK1.8版本

transient Node<K,V>[] table;

transient Set<Map.Entry<K,V>> entrySet;
复制代码

通过源码可以看出HashMap底层是由数组Node<K,V>table和链表entrySet组成的复合结构。在JDK1.8之前,数组中的元素名称是Entry,而JDK1.8引入了树形结构,无论是树还是链表,其元素都统称为结点,故更名为Node更为贴切。

final int hash;
final K key;
V value;
Node<K,V> next;

Node(int hash, K key, V value, Node<K,V> next) {
    this.hash = hash;
    this.key = key;
    this.value = value;
    this.next = next;
}
复制代码

进入Node中,我们可以发现Node是由哈希值hash,键值对keyvalue和下一个结点next组成的。数组被分为一个个的bucket(可参考拉链法的图),通过哈希值决定了键值对在数组中的寻址。哈希值相同的键值对则以链表的形式存储。

/**
 * The default initial capacity - MUST be a power of two.
 */
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16
复制代码

从此处可以看出HashMap的数组默认初始化容量为1 << 4,即16。使用位移运算保证容量始终为2的幂次方。

/**
 * The bin count threshold for using a tree rather than list for a
 * bin.  Bins are converted to trees when adding an element to a
 * bin with at least this many nodes. The value must be greater
 * than 2 and should be at least 8 to mesh with assumptions in
 * tree removal about conversion back to plain bins upon
 * shrinkage.
 */
static final int TREEIFY_THRESHOLD = 8;

/**
 * The bin count threshold for untreeifying a (split) bin during a
 * resize operation. Should be less than TREEIFY_THRESHOLD, and at
 * most 6 to mesh with shrinkage detection under removal.
 */
static final int UNTREEIFY_THRESHOLD = 6;
复制代码

在JDK1.8中,我们可以看到一个静态常量TREEIFY_THRESHOLD,值为8,也就是上文讲述的,当链表长度大于8时,HashMap内部数据结构会“树化”,也就是转换成红黑树结构。同时,在元素被删除后,长度重新小于6时,红黑树会重新变回链表结构。

接下来,我们来解释为什么HashMap能够保证哈希表的容量是2的幂次方。

/**
 * Returns a power of two size for the given target capacity.
 */
static final int tableSizeFor(int cap) {
    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}
复制代码

继续阅读源码,我们可以看到tableSizeFor()函数。这个函数将传入的cap即容量进行减一运算,然后依次:将n右移1位,取“或”,右移两位,取“或”,右移4位,取“或”······以此类推。直到右移16位再取“或”,这时就能得到32个1。最后,进行判断。若n < 0时,则n取1,否则判断n是否大于最大容量,若大于最大容量,则取最大,否则取n + 1的值。为了更清楚地演示,我将得到的结果表示在以下表格。

2的幂次方 二进制 十进制
2^0 1 (1-1)+1
2^1 10 (2-1)+1
2^2 100 (4-1)+1
2^3 1000 (8-1)+1
2^4 10000 (16-1)+1
2^5 100000 (32-1)+1

从表中我们可以清晰地看出,经过运算,n的值肯定是2的幂次方。这也就解释了哈希表的容量是2的幂次方。

继续查看源码。HashMap提供4个构造函数,如下所示:


/**
 * 指定容量和负载因子的构造函数
 */
public HashMap(int initialCapacity, float loadFactor) {
    if (initialCapacity < 0)
        throw new IllegalArgumentException("Illegal initial capacity: " +
                                           initialCapacity);
    if (initialCapacity > MAXIMUM_CAPACITY)
        initialCapacity = MAXIMUM_CAPACITY;
    if (loadFactor <= 0 || Float.isNaN(loadFactor))
        throw new IllegalArgumentException("Illegal load factor: " +
                                           loadFactor);
    this.loadFactor = loadFactor;
    this.threshold = tableSizeFor(initialCapacity);
}

/**
 * 指定容量的构造函数
 */
public HashMap(int initialCapacity) {
    this(initialCapacity, DEFAULT_LOAD_FACTOR);
}

/**
 * 默认构造函数
 */
public HashMap() {
    this.loadFactor = DEFAULT_LOAD_FACTOR; // all other fields defaulted
}

/**
 * 传入另一个Map的构造函数
 *
 * @param   m the map whose mappings are to be placed in this map
 * @throws  NullPointerException if the specified map is null
 */
public HashMap(Map<? extends K, ? extends V> m) {
    this.loadFactor = DEFAULT_LOAD_FACTOR;
    putMapEntries(m, false); 
}
复制代码

在构造函数中,我们可以看到一个float类型的loadFactor变量。这个变量是HashMap的负载因子。什么是负载因子?负载因子可以控制数组存放数据的疏密程度。负载因子越接近于1,发生哈希冲突的概率越大,也就是数组中存放的数据就越多(密),即链表的长度会增加。相反,负载因子越接近于0,数组中存放的数据越少(疏)。

loadFactor取多少合适?

/**
 * The load factor used when none specified in constructor.
 */
static final float DEFAULT_LOAD_FACTOR = 0.75f;
复制代码

loadFactor过大会导致链表的搜索效率变低,因为链表太长。而过小又会导致数组利用率低,存放的数据分散。所以官方经过测试给出了一个合理的默认值0.75f,所以一般我们不需要修改。

loadFactor体现在哪里?

举个例子。当默认容量为16,负载因子为0.75时,则桶的扩容阈值为:16 * 0.75 = 12。也就是说,当数量达到12时,bucket就会满,就需要将当前16的容量调用resize()进行扩容。

接下来我们查看put()方法。

public V put(K key, V value) {
    return putVal(hash(key), key, value, false, true);
}
复制代码

可以发现put方法调用了putVal方法。putVal方法仅供put方法调用,并没有提供给用户使用。我们进入到putVal方法中。

/**
 * Implements Map.put and related methods.
 *
 * @param hash hash for key
 * @param key the key
 * @param value the value to put
 * @param onlyIfAbsent if true, don't change existing value
 * @param evict if false, the table is in creation mode.
 * @return previous value, or null if none
 */
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
               boolean evict) {
    Node<K,V>[] tab; Node<K,V> p; int n, i;
    // table没有初始化或长度为0,则进行扩容。
    // 此处的resize()方法不仅能够进行扩容,也能够进行初始化工作。
    // 由此可推断,table的初始化方式是懒加载,在有元素插入时才调用resize()
    if ((tab = table) == null || (n = tab.length) == 0)
        n = (tab = resize()).length;
    // (n - 1) & hash操作,如上文所述
    // 若桶为空,则新生成的结点放桶里,这个地方的结点放在数组中,即头结点。
    if ((p = tab[i = (n - 1) & hash]) == null)
        tab[i] = newNode(hash, key, value, null);
    // 桶不为空的情况
    else {
        Node<K,V> e; K k;
        // 比较桶中第一个元素(即数组中的结点)的hash值,key值是否相等
        // 若相等,则赋值:e = p
        if (p.hash == hash &&
            ((k = p.key) == key || (key != null && key.equals(k))))
            e = p;
        // 若不相等,判断是否为红黑树结点
        else if (p instanceof TreeNode)
            // 是:存储结点到红黑树中
            e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
        // 为链表的情况
        else {
            //循环遍历到链表最后插入结点
            for (int binCount = 0; ; ++binCount) {
                // 已经遍历到最后一个元素
                if ((e = p.next) == null) {
                    // 在尾部插入新结点
                    p.next = newNode(hash, key, value, null);
                    // 如上文所示,链表结点数若大于默认树化阈值8,则转为红黑树结构
                    // 同样如上文所示,只有数组长度大于阈值64,才会转换为红黑树。在treeifyBin中可见。
                    if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                        treeifyBin(tab, hash);
                    // 完成逻辑,跳出死循环
                    break;
                }
                // 判断链表中的结点元素key与插入的元素key是否相等
                // 若相等则直接跳出循环,否则赋值:p = e
                if (e.hash == hash &&
                    ((k = e.key) == key || (key != null && key.equals(k))))
                    break;
                p = e;
            }
        }
        if (e != null) { // 存在相同的key
            //记录已存在的旧值
            V oldValue = e.value;
            // 如果onlyIfAbsent设为false或oldValue为空,则用新值进行替换。
            if (!onlyIfAbsent || oldValue == null)
                e.value = value;
            // 访问结点后回调,主要用于LinkedHashMap
            afterNodeAccess(e);
            return oldValue;
        }
    }
    // 修改次数 + 1
    ++modCount;
    // 扩容!
    if (++size > threshold)
        resize();
    //插入结点后回调,主要用于LinkedHashMap
    afterNodeInsertion(evict);
    return null;
}
复制代码

请结合注释仔细阅读代码。我将其put逻辑总结为以下步骤:

  1. 如果HashMap未被初始化,则调用resize()初始化HashMap
  2. 对元素的keyhash,然后计算出其下标。
  3. 如果没有出现哈希碰撞,则将元素放入bucket(桶)中。
  4. 如果出现哈希碰撞,则以链表的形式进行连接。
  5. 如果链表长度超过了默认阈值8,则将链表转换为红黑树。
  6. 如果链表长度重新低于阈值6,则将红黑树转回链表。
  7. 如果节点已经存在,则替换旧值。
  8. 如果bucket已满,(容量16 * 负载因子0.75),就需要resize()扩容。

HashMap扩容带来的问题:

  • 多线程环境下,调整大小会存在条件竞争,容易造成死锁。
  • resize方法涉及hash的重新分配,并且会遍历hash表中所有元素,是耗时操作。在实际开发中,应尽量避免resize。

我们继续分析resize的源码。

/**
 * Initializes or doubles table size.  If null, allocates in
 * accord with initial capacity target held in field threshold.
 * Otherwise, because we are using power-of-two expansion, the
 * elements from each bin must either stay at same index, or move
 * with a power of two offset in the new table.
 *
 * @return the table
 */
final Node<K,V>[] resize() {
    Node<K,V>[] oldTab = table;
    int oldCap = (oldTab == null) ? 0 : oldTab.length;
    int oldThr = threshold;
    int newCap, newThr = 0;
    if (oldCap > 0) {
        if (oldCap >= MAXIMUM_CAPACITY) {
            // 超过最大值阈值则设为整型的最大值。
            threshold = Integer.MAX_VALUE;
            return oldTab;
        }
        // 扩容为原来的两倍
        else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                 oldCap >= DEFAULT_INITIAL_CAPACITY)
            newThr = oldThr << 1; // double threshold
    }
    // 旧阈值 > 0直接初始化新容量为oldThr
    else if (oldThr > 0) // initial capacity was placed in threshold
        newCap = oldThr;
    else {               // zero initial threshold signifies using defaults
        // 否则新容量为默认初始化容量:16
        newCap = DEFAULT_INITIAL_CAPACITY;
        // 新阈值为:负载因子 * 默认初始化容量,即 0.75 * 16 = 12
        newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
    }
    // 计算新的resize上限
    if (newThr == 0) {
        float ft = (float)newCap * loadFactor;
        newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                  (int)ft : Integer.MAX_VALUE);
    }
    threshold = newThr;
    @SuppressWarnings({"rawtypes","unchecked"})
    Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
    table = newTab;
    if (oldTab != null) {
        // 将bucket中的元素移动到新的bucket中
        for (int j = 0; j < oldCap; ++j) {
            Node<K,V> e;
            if ((e = oldTab[j]) != null) {
                oldTab[j] = null;
                if (e.next == null)
                    newTab[e.hash & (newCap - 1)] = e;
                else if (e instanceof TreeNode)
                    ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                else { // preserve order
                    Node<K,V> loHead = null, loTail = null;
                    Node<K,V> hiHead = null, hiTail = null;
                    Node<K,V> next;
                    do {
                        next = e.next;
                        if ((e.hash & oldCap) == 0) {
                            if (loTail == null)
                                loHead = e;
                            else
                                loTail.next = e;
                            loTail = e;
                        }
                        else {
                            if (hiTail == null)
                                hiHead = e;
                            else
                                hiTail.next = e;
                            hiTail = e;
                        }
                    } while ((e = next) != null);
                    if (loTail != null) {
                        loTail.next = null;
                        newTab[j] = loHead;
                    }
                    if (hiTail != null) {
                        hiTail.next = null;
                        newTab[j + oldCap] = hiHead;
                    }
                }
            }
        }
    }
    return newTab;
}
复制代码

分析完put方法和resize扩容方法,我们继续分析get方法。

public V get(Object key) {
    // 定义一个Node结点
    Node<K,V> e;
    return (e = getNode(hash(key), key)) == null ? null : e.value;
}

final Node<K,V> getNode(int hash, Object key) {
    Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
    if ((tab = table) != null && (n = tab.length) > 0 &&
        (first = tab[(n - 1) & hash]) != null) {
        // 数组中元素相等的情况
        if (first.hash == hash && // always check first node
            ((k = first.key) == key || (key != null && key.equals(k))))
            return first;
        // bucket中不止一个结点
        if ((e = first.next) != null) {
            //判断是否为TreeNode树结点
            if (first instanceof TreeNode)
                //通过树的方法获取结点
                return ((TreeNode<K,V>)first).getTreeNode(hash, key);
            do {
                //通过链表遍历获取结点
                if (e.hash == hash &&
                    ((k = e.key) == key || (key != null && key.equals(k))))
                    return e;
            } while ((e = e.next) != null);
        }
    }
    // 找不到就返回null
    return null;
}
复制代码

总结

类型 数据结构 线程安全 对null的支持 实现线程安全的方式
HashMap < JDK1.8: Array + Linked List>=JDK1.8: Array + Linked List + Red-Black Tree no Allow null keys and null values none

Guess you like

Origin juejin.im/post/7086737105262477348