搞懂HashMap

HashMap是Java中对散列表(也叫哈希表)的实现,是Java程序员使用频率最高的用于映射(键值对)处理的数据类型,同时也是面试官的最爱。搞懂HashMap,很重要。

看了那么多篇文章,不如走读一次代码。

变量参数

先来看看HashMap相对重要的变量,

/**
* The default initial capacity - MUST be a power of two.
* 默认的初始化容量16,这个值一定是2的幂
*/
static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16

/**
* The maximum capacity, used if a higher value is implicitly specified
* by either of the constructors with arguments.
* MUST be a power of two <= 1<<30.
* 最大的容量,如果HashMap(int initialCapacity, float loadFactor)构造器中的参数
* 较大,也取这个值。这个值一定是2的幂,且<= 1<<30
*/
static final int MAXIMUM_CAPACITY = 1 << 30;

/**
* The load factor used when none specified in constructor.
* 默认的负载因子0.75(3/4)
*/
static final float DEFAULT_LOAD_FACTOR = 0.75f;

/**
* The bin count threshold for using a tree rather than list for a
* bin.  Bins are converted to trees when adding an element to a
* bin with at least this many nodes. The value must be greater
* than 2 and should be at least 8 to mesh with assumptions in
* tree removal about conversion back to plain bins upon
* shrinkage.
* 从链表进化成树结构的阀值
*/
static final int TREEIFY_THRESHOLD = 8;

 /**
 * The bin count threshold for untreeifying a (split) bin during a
 * resize operation. Should be less than TREEIFY_THRESHOLD, and at
 * most 6 to mesh with shrinkage detection under removal.
 * 从树结构退化成链表的阀值
 */
static final int UNTREEIFY_THRESHOLD = 6;

 /**
 * The smallest table capacity for which bins may be treeified.
 * (Otherwise the table is resized if too many nodes in a bin.)
 * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
 * between resizing and treeification thresholds.
 * 树化应满足的最小容量。(否则如果不满足此值,应该对HashMap扩容)
 * (这个值应该至少是4 * TREEIFY_THRESHOLD,来决定是扩容还是树化)
 */
static final int MIN_TREEIFY_CAPACITY = 64;

/**
* The number of times this HashMap has been structurally modified
* Structural modifications are those that change the number of mappings in
* the HashMap or otherwise modify its internal structure (e.g.,
* rehash).  This field is used to make iterators on Collection-views of
* the HashMap fail-fast.  (See ConcurrentModificationException).
* Java的Fast-fail标志
*/
transient int modCount;

理解上面的参数,先来看看HashMap的数据结构视图

数组的大小即哈希表桶的大小,默认是16,对应上面的HashMap的容量;

链表、树的节点数量总和即HashMap实际存储的数量;

HashMap的容量乘以负载因子即HashMap扩容的阀值。

初始容量、负载因子可以通过构造参数调节。

理想的情况是,数组的每一个位置上都有且只有一个节点,这样空间和时间都是最完美的;但实际上不可能,于是hash(Key)碰撞时,就会使用链表来进行装载。

TREEIFY_THRESHOLDUNTREEIFY_THRESHOLDMIN_TREEIFY_CAPACITY则决定HashMap什么时候使用链表或者树结构来存储条目。

面试官爱问

  1. 16是2的幂,8也是,32也是,为啥偏偏选了16?

    没有特殊原因,是一个经验值,作者认为16这个初始容量是能符合常用的。

  2. 为什TREEIFY_THRESHOLD取8来决定链表进化成树?

    这个问题,交给Java的注释吧

    Because TreeNodes are about twice the size of regular nodes, we
    use them only when bins contain enough nodes to warrant use
    (see TREEIFY_THRESHOLD). And when they become too small (due to
    removal or resizing) they are converted back to plain bins.  In
    usages with well-distributed user hashCodes, tree bins are
    rarely used.  Ideally, under random hashCodes, the frequency of
    nodes in bins follows a Poisson distribution
    (http://en.wikipedia.org/wiki/Poisson_distribution) with a
    parameter of about 0.5 on average for the default resizing
    threshold of 0.75, although with a large variance because of
    resizing granularity. Ignoring variance, the expected
    occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
    factorial(k)). The first values are:
       0:    0.60653066
       1:    0.30326533
       2:    0.07581633
       3:    0.01263606
       4:    0.00157952
       5:    0.00015795
       6:    0.00001316
       7:    0.00000094
       8:    0.00000006
       more: less than 1 in ten million
    因为TreeNode的大小约为常规节点的两倍,所以我们仅在bin包含TREEIFY_THRESHOLD的节点时才使用它们。当它们变得太小(由于移除或调整大小)时,它们会转换回常规的节点。在使用具有良好分布的用户hashCode的用法中,很少使用树。理想情况下,在随机hashCodes下,bin中节点的频率遵循Poisson分布。默认负载因子为0.75,平均参数约为0.5,尽管由于调整粒度的差异很大。忽略方差,列表大小k的预期出现次数是(exp(-0.5)* pow(0.5,k)/ factorial(k))。可以看出,K为8时,出现的概率时亿分之6。

构造函数

  1. 没有特别要求,一般情况,我们都使用
/**
* Constructs an empty <tt>HashMap</tt> with the default initial capacity
* (16) and the default load factor (0.75).
*/
public HashMap() {
  this.loadFactor = DEFAULT_LOAD_FACTOR; // all other fields defaulted
}

构造默认容量和默认负载因子的HashMap,这里注意,实际上HashMap尚未初始化。

  1. 考虑到存储容量,我们使用
/**
* Constructs an empty <tt>HashMap</tt> with the specified initial
* capacity and the default load factor (0.75).
*
* @param  initialCapacity the initial capacity.
* @throws IllegalArgumentException if the initial capacity is negative.
*/
public HashMap(int initialCapacity) {
  this(initialCapacity, DEFAULT_LOAD_FACTOR);
}

构造指定容量和默认负载因子的HashMap,这里指定是为了过多扩容造成的低性能,下面会讨论。

  1. 如果要深度定制,可以使用
/**
* Constructs an empty <tt>HashMap</tt> with the specified initial
* capacity and load factor.
*
* @param  initialCapacity the initial capacity
* @param  loadFactor      the load factor
* @throws IllegalArgumentException if the initial capacity is negative
*         or the load factor is nonpositive
*/
public HashMap(int initialCapacity, float loadFactor) {
  if (initialCapacity < 0)
    throw new IllegalArgumentException("Illegal initial capacity: " +
                                       initialCapacity);
  if (initialCapacity > MAXIMUM_CAPACITY)
    initialCapacity = MAXIMUM_CAPACITY;
  if (loadFactor <= 0 || Float.isNaN(loadFactor))
    throw new IllegalArgumentException("Illegal load factor: " +
                                       loadFactor);
  this.loadFactor = loadFactor;
  this.threshold = tableSizeFor(initialCapacity);
}

构造指定容量和指定负载因子的HashMap,使用时特别注意(摘自类注释),

As a general rule, the default load factor (.75) offers a good
tradeoff between time and space costs.Higher values decrease the
space overhead but increase the lookup cost (reflected in most of
the operations of the <tt>HashMap</tt> class, including
<tt>get</tt> and <tt>put</tt>). The expected number of entries in
the map and its load factor should be taken into account when
setting its initial capacity, so as to minimize the number of
rehash operations.  If the initial capacity is greater than the
maximum number of entries divided by the load factor, no rehash
operations will ever occur.
通常默认负载因子0.75提供了在时间和空间之间提供了很好的折中。更高的负载因子减少空间但是会增加查询消耗(HashMap的大部分操作,包括get、put)。预期的存储数量和负载因子应该在初始化容量时候进行考虑,以减少rehash的操作。如果初始化容量大于预期存储数量除以负载因子,将不会发生rehash的操作。

散列过程

散列表用的是数组支持按照下标随机访问数组的特性来保证高效率的查询。把Key转化为数组下标的方法被称为散列函数,而三裂函数计算得到的值就叫做散列值

那么先看看HashMap中的hash函数,

/**
* Computes key.hashCode() and spreads (XORs) higher bits of hash
* to lower.  Because the table uses power-of-two masking, sets of
* hashes that vary only in bits above the current mask will
* always collide. (Among known examples are sets of Float keys
* holding consecutive whole numbers in small tables.)  So we
* apply a transform that spreads the impact of higher bits
* downward. There is a tradeoff between speed, utility, and
* quality of bit-spreading. Because many common sets of hashes
* are already reasonably distributed (so don't benefit from
* spreading), and because we use trees to handle large sets of
* collisions in bins, we just XOR some shifted bits in the
* cheapest possible way to reduce systematic lossage, as well as
* to incorporate impact of the highest bits that would otherwise
* never be used in index calculations because of table bounds.
*/
static final int hash(Object key) {
  int h;
  // key可以使空的哦
  return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

从注释里可以得到,

  1. 使用高16位和低16位进行异或,变相保留高位信息减少碰撞

当两个Key的hashCode只在高位变化时,就很容易产生碰撞,这个主要是由于HashMap的散列算法。

  1. 由于常见的hashCode已经足够合理分布,且因为使用了树结构来解决碰撞严重时的查询效率,因此使用较为简单的hash计算。

而散列逻辑tab[i = (n - 1) & hash]),即对tab长度进行取余,这里显然是为了提高计算效率。

当 lenth = 2^n 时,X % length = X & (length - 1)

下面看下实际的hash过程,高位与低位进行异或

再来看看散列的与逻辑,

这里可以看到,当HashMap比较小比如默认16时,参与与的位数都是比较低的几位,当key的变化都集中在高位时,低位能表示(2,18,34),(6,22,38)这样以16的倍数为差的等差数列就会形成碰撞,因此进行hash的时候,高位与低位进行异或使高低位打散。

面试官爱问

  1. Java7和Java8的HashMap有什么变化?

    • hash函数,7异或5次,8一次;

    • 条目存储,7使用链表,8使用链表和红黑树;

操作函数

put

/**
* 关联k,v。
* Associates the specified value with the specified key in this map.
* If the map previously contained a mapping for the key, the old
* value is replaced.
*
* @param key key with which the specified value is to be associated
* @param value value to be associated with the specified key
* @return the previous value associated with <tt>key</tt>, or
*         <tt>null</tt> if there was no mapping for <tt>key</tt>.
*         (A <tt>null</tt> return can also indicate that the map
*         previously associated <tt>null</tt> with <tt>key</tt>.)
*/
public V put(K key, V value) {
  return putVal(hash(key), key, value, false, true);
}

/**
* Implements Map.put and related methods.
*
* @param hash hash for key
* @param key the key
* @param value the value to put
* @param onlyIfAbsent if true, don't change existing value
* @param evict if false, the table is in creation mode.
* @return previous value, or null if none
*/
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
               boolean evict) {
  Node<K,V>[] tab; Node<K,V> p; int n, i;
  // table未初始化或者长度为0,进行扩容
  if ((tab = table) == null || (n = tab.length) == 0)
    n = (tab = resize()).length;
  // 如果桶里没有值,新生成新节点放入
  if ((p = tab[i = (n - 1) & hash]) == null)
    tab[i] = newNode(hash, key, value, null);
  else { // 否则桶里已经有值p
    Node<K,V> e; K k;
    // 比较第一个元素hash相等、key相等
    if (p.hash == hash && 
        ((k = p.key) == key || (key != null && key.equals(k))))
      e = p; // 将第一个赋值给e
    else if (p instanceof TreeNode)
      // 否则如果是树,则放到树里
      e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
    else {
      // 否则不会为树,是链表
      for (int binCount = 0; ; ++binCount) {
        // 加到链表末尾
        if ((e = p.next) == null) {
          p.next = newNode(hash, key, value, null);
          // 结点数量达到阈值,调用treeifyBin()做进一步判断是否转为红黑树
          if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
            treeifyBin(tab, hash);
          break;
        }
        // 链表中有节点与当前插入的节点hash相等、key相等
        if (e.hash == hash &&
            ((k = e.key) == key || (key != null && key.equals(k))))
          break;
        p = e;
      }
    }
    if (e != null) { // existing mapping for key
      // 存在key值、hash值与插入元素相等的结点
      V oldValue = e.value;
      // 当不是onlyIfAbsent 或者 旧值为空的情况,更新value
      if (!onlyIfAbsent || oldValue == null)
        e.value = value;
      afterNodeAccess(e); // 回调函数
      return oldValue;
    }
  }
  ++modCount;
  // Fast-fail标志
  if (++size > threshold)
    // 如果荣玲已经到达上限,扩容
    resize();
  afterNodeInsertion(evict); // 回调函数
  return null;
}

resize

/**
* 初始化或者两倍扩容。
* Initializes or doubles table size.  If null, allocates in
* accord with initial capacity target held in field threshold.
* Otherwise, because we are using power-of-two expansion, the
* elements from each bin must either stay at same index, or move
* with a power of two offset in the new table.
*
* @return the table
*/
final Node<K,V>[] resize() {
  Node<K,V>[] oldTab = table;
  int oldCap = (oldTab == null) ? 0 : oldTab.length;
  int oldThr = threshold;
  int newCap, newThr = 0;
  //以前的容量大于0,也就是hashMap中已经有元素了,或者new对象的时候设置了初始容量
  if (oldCap > 0) {
    if (oldCap >= MAXIMUM_CAPACITY) {
      //如果以前的容量大于限制的最大容量1<<30,则设置临界值为int的最大值2^31-1
      threshold = Integer.MAX_VALUE;
      return oldTab;
    }
    /**
    * 如果以前容量的2倍小于限制的最大容量,同时大于或等于默认的容量16,则设置可存储容量为以前可存储容量的2倍,因为threshold = loadFactor*capacity,capacity扩大了2倍,loadFactor不变,threshold自然也扩大2倍。
    */
    else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
             oldCap >= DEFAULT_INITIAL_CAPACITY)
      newThr = oldThr << 1; // double threshold
  }
  /**
  * 在HashMap构造器Hash(int initialCapacity, float loadFactor)中有一句代码,this.threshold = tableSizeFor(initialCapacity), 表示在调用构造器时,默认是将初始容量暂时赋值给了threshold临界值,因此此处相当于将上一次的初始容量赋值给了新的容量。什么情况下会执行到这句?当调用了HashMap(int initialCapacity)构造器,还没有添加元素时
  */
  else if (oldThr > 0) // initial capacity was placed in threshold
    newCap = oldThr;
  /**
  * 调用了默认构造器,初始容量没有设置,因此使用默认容量DEFAULT_INITIAL_CAPACITY(16),临界值就是16*0.75
  */
  else {               // zero initial threshold signifies using defaults
    newCap = DEFAULT_INITIAL_CAPACITY;
    newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
  }
  // 对临界值做判断,确保其不为0,因为在上面第二种情况(oldThr > 0),并没有计算newThr
  if (newThr == 0) {
    float ft = (float)newCap * loadFactor;
    newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
              (int)ft : Integer.MAX_VALUE);
  }
  threshold = newThr;
  @SuppressWarnings({"rawtypes","unchecked"})
  Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
  table = newTab;
  if (oldTab != null) {
    // 遍历将原来table中的数据放到扩容后的新表中来
    for (int j = 0; j < oldCap; ++j) {
      Node<K,V> e;
      if ((e = oldTab[j]) != null) {
        oldTab[j] = null;
        // 原来桶里只有一个元素
        if (e.next == null)
          newTab[e.hash & (newCap - 1)] = e;
        // 否则后面还有元素,且是树结构的,对树进行rehash
        else if (e instanceof TreeNode)
          ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
        else { // preserve order
          // 否则后面还有元素,且是链表结构的,对链表进行rehash
          Node<K,V> loHead = null, loTail = null;
          Node<K,V> hiHead = null, hiTail = null;
          Node<K,V> next;
          do {
            next = e.next;
            if ((e.hash & oldCap) == 0) {
              if (loTail == null)
                loHead = e;
              else
                loTail.next = e;
              loTail = e;
            }
            else {
              if (hiTail == null)
                hiHead = e;
              else
                hiTail.next = e;
              hiTail = e;
            }
          } while ((e = next) != null);
          if (loTail != null) {
            loTail.next = null;
            newTab[j] = loHead;
          }
          if (hiTail != null) {
            hiTail.next = null;
            newTab[j + oldCap] = hiHead;
          }
        }
      }
    }
  }
  return newTab;
}

注意,resize的时候会涉及到rehash的过程,但并不是所有的元素都需要移动,

如图,2倍扩容时,原来的tab[i = (n - 1) & hash]),因为n = 2n,则i = 2n-1,此时影响的是最高位进行与的条目,所以HashMap都做了相应判断,才决定是否移位。

// 链表
if ((e.hash & oldCap) == 0) {}
// 树
if ((e.hash & bit) == 0) {}

面试官爱问

  1. HashMap插入出现死循环,是为什么?

    JDK8 用 head 和 tail 来保证链表的顺序和之前一样,不会因为多线程 put 导致死循环;

    JDK7则是resize时rehash并发时造成的环形链表。

  2. HashMap是否是线程安全,如果不是,如何保证?

    HashMap里的注释已经标注了不是。通过锁或者Collections.synchronizedMap(new HashMap(...));包装来达到安全的效果,考虑到性能,应该使用CurrentHashMap

  3. HashMap/HashTable有什么区别?

    HashMap,不是线程安全的,key和value都允许为null。key为null的键值对永远都放在以table[0]为头结点的链表中。

    HashTable,是线程安全的(方法上都有synchronize)。key(key为空时,hashCode会空指针)、value(显式判断)都不允许为null。

    HashMap继承了AbstractMap,HashTable继承Dictionary抽象类,两者均实现Map接口。
    HashMap的初始容量为16,Hashtable初始容量为11,两者的填充因子默认都是0.75。

    HashMap扩容时是当前容量翻倍即capacity乘2,Hashtable扩容时是容量翻倍+1即capacity乘2+1。

    HashMap和Hashtable的底层实现都是数组+链表结构实现。
    两者计算hash的方法不同;Hashtable计算hash是直接使用key的hashcode对table数组的长度直接进行取模:

  4. Fast-fail是什么?

    是Java集合中的一种机制,在用迭代器遍历一个集合对象时,如果遍历过程中对集合对象的内容进行修改(增删改),则会跑出ConcurrentModificationException。

先写这么多了,我也需要消化一下。囧。

猜你喜欢

转载自www.cnblogs.com/ranyabu/p/12151694.html