HashMap 学习

部分源码解析

变量和常量

// 默认的初始化容量，十进制数为 16
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

// 极限容量，也就是说 table 数组再大也不能超过这个容量
static final int MAXIMUM_CAPACITY = 1 << 30;

// 默认的负载因子：0.75（关系到 HashMap 本身的性能）
static final float DEFAULT_LOAD_FACTOR = 0.75f;

// 一个空的 Entry 数组，在 table 数组没有进行初始化扩容之前一直就是这个空数组
static final Entry<?,?>[] EMPTY_TABLE = {};

// HashMap 的存储结构就是这个 table 数组，长度为 2 的 n 次方数;
// 数组中存储的元素是一个个 Entry 对象，而 Entry 对象又实现了链表结构;
// 所以说 HashMap 的数据结构是 数组 + 链表，具体请看【存储结构】章节
transient Entry<K,V>[] table = (Entry<K,V>[]) EMPTY_TABLE;

// HashMap 存储的元素多少
transient int size;

// 阀值，一般来说为 实际容量 * 负载因子
int threshold;

// 实际的负载因子
final float loadFactor;

// 修改次数，用于快速失败，具体请看【快速失败】章节
transient int modCount;

构造方法

public HashMap(int initialCapacity, float loadFactor) {
    if (initialCapacity < 0)
        throw new IllegalArgumentException("Illegal initial capacity: " +
                                           initialCapacity);
    if (initialCapacity > MAXIMUM_CAPACITY)
        initialCapacity = MAXIMUM_CAPACITY;
    if (loadFactor <= 0 || Float.isNaN(loadFactor))
        throw new IllegalArgumentException("Illegal load factor: " +
                                           loadFactor);

    this.loadFactor = loadFactor;
    threshold = initialCapacity;
    init();// 此方法是空的
}

public HashMap(int initialCapacity) {
    this(initialCapacity, DEFAULT_LOAD_FACTOR);
}

public HashMap() {
    this(DEFAULT_INITIAL_CAPACITY, DEFAULT_LOAD_FACTOR);
}

public HashMap(Map<? extends K, ? extends V> m) {
    this(Math.max((int) (m.size() / DEFAULT_LOAD_FACTOR) + 1,
                  DEFAULT_INITIAL_CAPACITY), DEFAULT_LOAD_FACTOR);
    inflateTable(threshold);

    putAllForCreate(m);
}

(1). 前三个构造方法，无非就是设置一下初始化容量和负载因子，不多说。

(2). 而第四个构造方法：

设置初始化容量和负载因子；
inflateTable(...) 初始化数组并扩充容量；
将 map 对象塞进空的 HashMap 中。

inflateTable(...)

// 对 table 数组进行扩容
private void inflateTable(int toSize) {
    // Find a power of 2 >= toSize
    int capacity = roundUpToPowerOf2(toSize);

    threshold = (int) Math.min(capacity * loadFactor, MAXIMUM_CAPACITY + 1);
    table = new Entry[capacity];
    initHashSeedAsNeeded(capacity);
}

private static int roundUpToPowerOf2(int number) {
    // assert number >= 0 : "number must be non-negative";
    return number >= MAXIMUM_CAPACITY
            ? MAXIMUM_CAPACITY
            : (number > 1) ? Integer.highestOneBit((number - 1) << 1) : 1;
}

(1). 一直往前推，很容易得知传进来的 toSize 参数就是初始化容量；

(2). roundUpToPowerOf2(toSize) 方法得到返回值是一个比 toSize 大的最小的 2 的 n 次方数 capacity，capacity 就是实际容量；

若 toSize 为 15，则 capacity 为 16；若 toSize 为 17，则 capacity 为 32。因此可以看出：table 数组的长度一定是 2 的 n 次方数。

(3). 计算得出阀值 = 实际容量 * 负载因子（capacity * loadFactor）。

(4). new 出一个真实容量为 capacity 的 Entry 数组。

hash 函数

final int hash(Object k) {
    int h = hashSeed;
    if (0 != h && k instanceof String) {
        return sun.misc.Hashing.stringHash32((String) k);
    }

    h ^= k.hashCode();

    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

首先我们要明确一点，hash 算法有什么目的，说简单了就是为了快。

hash 算法是为了减少查找过程中的比较次数；
查找的最理想情况就是一次找到，当然这种情况往往是通过牺牲存储空间实现的，实际应用之中并不可取，但是可以看出 hash 算法在查找方面的高效性。

通常情况下，有 hash 函数的地方也能见到 indexFor 函数的身影。

static int indexFor(int h, int length) {
    // assert Integer.bitCount(length) == 1 : "length must be a non-zero power of 2";
    return h & (length-1);
}

h & (length -1) 得到这个元素应该存储在数组的哪个索引处。

h & (length -1) 得到值永远在数组的索引值范围内，不会出现数组下标越界的情况，为什么呢 ?

因为数组的长度是 2 的 n 次方数，所以 length -1 位最大的 n 位二进制数，即所有位都是 1：

当 h > length -1 时，h 的高位忽略，得到的数最大也就是 length -1；
当 h <= length -1 时，若有高位则忽略所有高位，得到的数也不可能超过 length -1。

hash 函数并不是简简单单的求下哈希码，而是在这个基础上又做了多次位操作，这样做有何目的呢 ?

如果 hashCode 值都大于 length，而且这些 hashCode 的低位变化不大，就会出现很多冲突。举个例子：

假设数组的初始化容量为 16（10000），则 length -1 位 15（1111）；
有几个对象的 hashCode 分别为 1 1001 0010、1 1101 0010、11 1011 0010，进行位与运算的结果是一致的（都为 0010），即发生了冲突；
而对 hashCode 进行位操作得到 1 1000 1000、1 1100 1100、11 1000 1110，进行位与运算分别得到 1000、1100、1110，减少了冲突。

因此，发生冲突的原因是 16 限制了只能用低位来计算，高位直接被舍弃无法参与计算，所以需要进行位操作使高位对低位造成影响改变低位的值，从而变相地使高位也参与运算。

put 操作（包含 update）

先看一下单个元素的 put 操作：

public V put(K key, V value) {
    if (table == EMPTY_TABLE) {
        inflateTable(threshold);
    }
    if (key == null)
        return putForNullKey(value);
    int hash = hash(key);
    int i = indexFor(hash, table.length);
    for (Entry<K,V> e = table[i]; e != null; e = e.next) {
        Object k;
        if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
            V oldValue = e.value;
            e.value = value;
            e.recordAccess(this);
            return oldValue;
        }
    }

    modCount++;
    addEntry(hash, key, value, i);
    return null;
}

(1). 如果数组为空，先对其进行初始化并扩容；

(2). 如果 key==null，使用 putForNullKey 操作（下面有代码）：

遍历 table[0] 位置的链表，若遇到 key==null 的节点，就替换 value；若遇不到，就将这个元素插入链表的链首。

由此可见，如果有key==null 的元素肯定存储在table[0] 位置的链表中。

private V putForNullKey(V value) {
    for (Entry<K,V> e = table[0]; e != null; e = e.next) {
        if (e.key == null) {
            V oldValue = e.value;
            e.value = value;
            e.recordAccess(this);
            return oldValue;
        }
    }
    modCount++;
    addEntry(0, null, value, 0);
    return null;
}

(3). 如果 key不为 null，hash 之后进行定位得到数组下标 x，对 table[x] 遍历所有进行比较，如果找得到就进行替换，找不到就插入链首

再看下 putAll 操作：

public void putAll(Map<? extends K, ? extends V> m) {
    int numKeysToBeAdded = m.size();
    if (numKeysToBeAdded == 0)
        return;

    if (table == EMPTY_TABLE) {
        inflateTable((int) Math.max(numKeysToBeAdded * loadFactor, threshold));
    }

    if (numKeysToBeAdded > threshold) {
        int targetCapacity = (int)(numKeysToBeAdded / loadFactor + 1);
        if (targetCapacity > MAXIMUM_CAPACITY)
            targetCapacity = MAXIMUM_CAPACITY;
        int newCapacity = table.length;
        while (newCapacity < targetCapacity)
            newCapacity <<= 1;
        if (newCapacity > table.length)
            resize(newCapacity);
    }

    for (Map.Entry<? extends K, ? extends V> e : m.entrySet())
        put(e.getKey(), e.getValue());
}

(1). 若 map 长度为0，到此为止；

(2). 若 table 为空，进行数组初始化并扩容；

(3). 若 map 的元素个数大于阀值，进行扩容，扩充后的容量和目标容量进行比较，若还是如此继续扩容，直到容量满足条件为止（条件就是数组容量不小于目标容量），扩容为向左移一位变为两倍。

(4). 剩下的就是遍历 map 一个一个元素 put，不多说。

resize 扩容

void resize(int newCapacity) {
    Entry[] oldTable = table;
    int oldCapacity = oldTable.length;
    if (oldCapacity == MAXIMUM_CAPACITY) {
        threshold = Integer.MAX_VALUE;
        return;
    }

    Entry[] newTable = new Entry[newCapacity];
    transfer(newTable, initHashSeedAsNeeded(newCapacity));
    table = newTable;
    threshold = (int)Math.min(newCapacity * loadFactor, MAXIMUM_CAPACITY + 1);
}

/**
 * Transfers all entries from current table to newTable.
 */
void transfer(Entry[] newTable, boolean rehash) {
    int newCapacity = newTable.length;
    for (Entry<K,V> e : table) {
        while(null != e) {
            Entry<K,V> next = e.next;
            if (rehash) {
                e.hash = null == e.key ? 0 : hash(e.key);
            }
            int i = indexFor(e.hash, newCapacity);
            e.next = newTable[i];
            newTable[i] = e;
            e = next;
        }
    }
}

resize 会改变 HashMap 的存储结构，对元素进行重新映射，是非常耗性能的操作。所以最好一开始就能确定新容量的取值（这一点在 putAll 中有很好的体现）

疑问：HashMap 使用数组 + 链表的存储结构，理论上可以无限存储数据，为什么要对数组进行扩容呢？

进行扩容能够优化存储结构，提高 HashMap 的性能。

如果不进行扩容，链表长度就会慢慢变得很长，查找的时候很大概率遍历到链表深处，就失去的 hash 查找的优势。

HashMap.Entry

先上代码。

static class Entry<K,V> implements Map.Entry<K,V> {
    final K key;
    V value;
    Entry<K,V> next;
    int hash;

    /**
     * 构造方法，创建一个 Entry 节点
     */
    Entry(int h, K k, V v, Entry<K,V> n) {
        // key，value，next 指针，hash 码
        value = v;
        next = n;
        key = k;
        hash = h;
    }

    public final K getKey() {
        return key;
    }

    public final V getValue() {
        return value;
    }

    // 设置新的 value，返回旧的 value
    public final V setValue(V newValue) {
        V oldValue = value;
        value = newValue;
        return oldValue;
    }

    // 判断两个节点是否相等
    public final boolean equals(Object o) {
        if (!(o instanceof Map.Entry))
            return false;
        Map.Entry e = (Map.Entry)o;
        Object k1 = getKey();
        Object k2 = e.getKey();
        if (k1 == k2 || (k1 != null && k1.equals(k2))) {
            Object v1 = getValue();
            Object v2 = e.getValue();
            if (v1 == v2 || (v1 != null && v1.equals(v2)))
                return true;
        }
        return false;
    }

    public final int hashCode() {
        return Objects.hashCode(getKey()) ^ Objects.hashCode(getValue());
    }

    public final String toString() {
        return getKey() + "=" + getValue();
    }

    /**
     * This method is invoked whenever the value in an entry is
     * overwritten by an invocation of put(k,v) for a key k that's already
     * in the HashMap.
     */
    void recordAccess(HashMap<K,V> m) {
    }

    /**
     * This method is invoked whenever the entry is
     * removed from the table.
     */
    void recordRemoval(HashMap<K,V> m) {
    }
}

Entry 实现了单向链表的功能。

所以，HashMap 的数据结构就是数组 + 链表：元素本身是存储在 Entry 中的，这些 Entry 构成许多条链表，而所有链表的链首构成了 table 数组。数组的下标就是 indexFor 计算出的索引，但是不同元素会出现计算出索引相同的情况，这就是冲突了，解决这个冲突的办法是具有相同索引的元素会以单向链表的方式存在。

找到一个网站有助于理解 HashMap 的结构以及增删改查操作：

https://www.cs.usfca.edu/~galles/visualization/OpenHash.html

get操作

public V get(Object key) {
    // 若key == null，就会调 getForNullKey()，得到并且返回 value，否则返回 null。
    // 若key != null，就会寻找该 Entry，得到并且返回 value，否则返回 null。
    if (key == null)
        return getForNullKey();
    Entry<K,V> entry = getEntry(key);

    return null == entry ? null : entry.getValue();
}

// 1. 数组长度为0，找不到，返回 null。
// 2. 遍历数组和链表寻找，找到返回 value，否则返回 null。
private V getForNullKey() {
    if (size == 0) {
        return null;
    }
    for (Entry<K,V> e = table[0]; e != null; e = e.next) {
        if (e.key == null)
            return e.value;
    }
    return null;
}

// 根据 key 值寻找 Entry
// 1. 数组长度为0，找不到，返回 null。
// 2. 根据 key 值获得 hash 码，再根据 hash 码得到应该存储在数组的哪个位置。
// 3. 在数组的该位置从链首开始遍历寻找，找到返回 value，否则就返回 null。
final Entry<K,V> getEntry(Object key) {
    if (size == 0) {
        return null;
    }

    int hash = (key == null) ? 0 : hash(key);
    for (Entry<K,V> e = table[indexFor(hash, table.length)];
         e != null;
         e = e.next) {
        Object k;
        if (e.hash == hash &&
            ((k = e.key) == key || (key != null && key.equals(k))))
            return e;
    }
    return null;
}

remove 操作

public V remove(Object key) {
    Entry<K,V> e = removeEntryForKey(key);
    return (e == null ? null : e.value);
}

// 如果元素个数为0，直接返回 null
// 元素个数不为0，根据 key 获得 hash码，根据 hash 码定位数组下标。
// 得到数组在该位置的 Entry，遍历如果找到这个 key，就删除并且返回该元素，找不到就返回 null。
final Entry<K,V> removeEntryForKey(Object key) {
    if (size == 0) {
        return null;
    }
    int hash = (key == null) ? 0 : hash(key);
    int i = indexFor(hash, table.length);
    Entry<K,V> prev = table[i];
    Entry<K,V> e = prev;

    while (e != null) {
        Entry<K,V> next = e.next;
        Object k;
        if (e.hash == hash &&
            ((k = e.key) == key || (key != null && key.equals(k)))) {
            modCount++;
            size--;
            if (prev == e)
                table[i] = next;
            else
                prev.next = next;
            e.recordRemoval(this);
            return e;
        }
        prev = e;
        e = next;
    }

    return e;
}

存储结构

上面部分也有提到，这里再补充一张图吧。

快速失败

与 ArrayList 等集合的快速失败原理一样。

先举个例子：

public class Test {

    HashMap<Integer, Integer> map = new HashMap<>();
    
    public static void main(String[] args) {
        Test test = new Test();
        test.putAndPrint();
        test.remove_1();
    }

    public void remove_2() {
        try {
            // 暂停10秒。目的是保证迭代器正在迭代过程中。
            Thread.sleep(1000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        Iterator<Integer> iterator = map.keySet().iterator();
        iterator.remove();
    }
    
    public void remove_1() {
        try {
            // 暂停10秒。目的是保证迭代器正在迭代过程中。
            Thread.sleep(1000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        map.remove(1000);
    }

    public void putAndPrint() {
        T1 t1 = new T1();
        Thread thread = new Thread(t1);
        thread.start();
    }

    class T1 implements Runnable {
        @Override
        public void run() {
            for (int i = 0; i < 1000000; i++) {
                map.put(i, i);
            }

            Iterator<Integer> iterator = map.keySet().iterator();
            for (;iterator.hasNext();) {
                try {
                    Thread.sleep(100);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
                Integer integer = (Integer) iterator.next();
                System.out.println(integer);
            }
        }
    }
}

输出为：

我们可以分析一下，另起了一个线程向 map 中 put 1000000个元素（这个时间可以忽略），生成迭代器，每隔大约 0.1 秒输出一个元素。

同时，主线程在 sleep 一秒钟后对 map 进行了 remove 操作，此时迭代器正在迭代过程中。

map.remove 中有操作 modCount++，因此迭代器在 nextEntry 方法中发现 modCount != expectedModCount 为 false 抛出异常。

同理若把 main 方法中的 test.remove_1()，改为 test.remove_2()，也会抛出这个异常，因为 HashIterator 的 remove 方法同样调的 map 的 remove 方法。