About HashMap you need to know some details

No article of this public link: About HashMap you need to know some of the details described in the official documentation:

Hash table based implementation of the Map interface. This implementation provides all of the optional map operations, and permits null values and the null key. (The HashMap class is roughly equivalent to Hashtable, except that it is unsynchronized and permits nulls.) This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.

Hash table implementation of the Map interface-based. This implementation provides all of the optional map operations, and allows nulls and spacebar. (Substantially equivalent to the HashMap class HashTable, except that it is not synchronized and allow nulls.) This class does not guarantee the order of the mapping; in particular, it does not guarantee over time the order is maintained unchanged.

Two important parameters

There are two very important parameters, the capacity (Capacity) and the load factor (Load factor) in the HashMap

  • Initial capacity The capacity is the number of buckets in the hash table, The initial capacity is simply the capacity at the time the hash table is created.
  • Load factor The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.

Simply put, Capacity is the number of buckets, Load factor is the ratio of the maximum degree of fill buckets. If iteration high performance requirements, then do not put capacity is set too high, it should not be set too low load factor. When the number of filling bucket number (i.e. the number of elements in the HashMap) capacity * load factor is greater than the need to adjust for the current buckets twice.

First, we look at the internal structure of the HashMap together, it can be seen as an array (Node [] table) and a composite structure composed of chain binding, is divided into an array of buckets (bucket), determine the key by the hash value the value in the address array; on the same key hash value, the storage places form the list, you can refer to the diagram below. It should be noted that, if the list size exceeds the threshold value (TREEIFY_THRESHOLD, 8), the figure list will be transformed into a tree structure.

Realization put function

Next view put ways:

General idea put function:

  1. The key to hashCode () do the hash, then the index is calculated;
  2. If you do not crash directly into the bucket inside;
  3. If the collision, the presence of the buckets in the form of a linked list;
  4. If collisions result in long chain (greater than or equal TREEIFY_THRESHOLD), put the list is converted into red-black trees;
  5. If the node already exists on replacing old value (ensure the uniqueness of the key)
  6. If the bucket is full (more than load factor * current capacity), it must resize.
public V put(K key, V value) {
    // 对key的hashCode()做hash
    return putVal(hash(key), key, value, false, true);
}
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
               boolean evict) {
    Node<K,V>[] tab; Node<K,V> p; int n, i;
    // tab为空则创建
    if ((tab = table) == null || (n = tab.length) == 0)
        n = (tab = resize()).length;
    // 计算index,并对null做处理
    if ((p = tab[i = (n - 1) & hash]) == null)
        tab[i] = newNode(hash, key, value, null);
    else {
        Node<K,V> e; K k;
        // 节点存在
        if (p.hash == hash &&
            ((k = p.key) == key || (key != null && key.equals(k))))
            e = p;
        // 该链为树
        else if (p instanceof TreeNode)
            e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
        // 该链为链表
        else {
            for (int binCount = 0; ; ++binCount) {
                if ((e = p.next) == null) {
                    p.next = newNode(hash, key, value, null);
                    if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                        treeifyBin(tab, hash);
                    break;
                }
                if (e.hash == hash &&
                    ((k = e.key) == key || (key != null && key.equals(k))))
                    break;
                p = e;
            }
        }
        // 写入
        if (e != null) { // existing mapping for key
            V oldValue = e.value;
            if (!onlyIfAbsent || oldValue == null)
                e.value = value;
            afterNodeAccess(e);
            return oldValue;
        }
    }
    ++modCount;
    // 超过load factor*current capacity,resize
    if (++size > threshold)
        resize();
    afterNodeInsertion(evict);
    return null;
}
复制代码

PutVal method from the first few lines, we can find a few interesting places:

  • If the table is null, resize method will be responsible for initiating it, from this tab = resize () can be seen.
  • resize method taking into account the two responsibilities, creating the initial memory table, or when capacity does not meet the demand for expansion (resize).
  • In the process of placing the new key-value pairs, if the following condition occurs, expansion will occur.
    if (++size > threshold)
    resize();
复制代码
  • DETAILED key position (array index) in the hash table depends on the following bit operations:
    i = (n - 1) & hash
复制代码

Cha Haxi carefully view the source of value, we will find that it is not the key hashCode itself, but another hash method from the internal HashMap. Note that why there needs to be a shift to lower high data XOR operation? This is the hash value of the difference because some of the data computed mainly at a high level, while HashMap hash addressing in more than ignore the high capacity, then this process can effectively avoid hash collisions in similar situations.

    static final int hash(Object kye) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>>16);
    }
复制代码
  • I mentioned earlier the list structure (here called bin), will reach a certain threshold, the occurrence of tree, I will need to analyze why HashMap bin for processing later.

We can see, putVal logical method itself is very concentrated, from initialization, the expansion of the tree, and it is all about.

The method further analyze resize multiple roles, many of my friends have often asked its feedback source design interviewer.

    
final Node<K,V>[] resize() {
    // ...
    else if ((newCap = oldCap << 1) < MAXIMUM_CAPACIY &&
                oldCap >= DEFAULT_INITIAL_CAPAITY)
        newThr = oldThr << 1; // double there
       // ...
    else if (oldThr > 0) // initial capacity was placed in threshold
        newCap = oldThr;
    else {
        // zero initial threshold signifies using defaultsfults newCap = DEFAULT_INITIAL_CAPAITY;
        newThr = (int)(DEFAULT_LOAD_ATOR* DEFAULT_INITIAL_CAPACITY;
    }
    if (newThr ==0) {
        float ft = (float)newCap * loadFator;
        newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?(int)ft : Intege
    }
    threshold = neThr;
    Node<K,V>[] newTab = (Node<K,V>[])new Node[newap]; table = n;
    // 移动到新的数组结构 e 数组结构 }
复制代码

Resize based on the source code, is not considered an extreme case (theoretical maximum capacity limit specified by the MAXIMUM_CAPACITY, a value of 1 << 30, which is 2 to the 30th power), we can be summarized as follows:

  • Equal to the threshold value (load factor) x (capacity), when not specified if they are constructed HashMap, it is often based on the respective default values.
  • Usually a multiple of the threshold adjustment (newThr = oldThr << 1), I mentioned earlier, according to the logic putVal, when the number of elements exceeds the threshold size, the size is adjusted Map.
  • After the expansion, the old need to reposition elements in the array to a new array, which is a major source of cost of expansion.

Capacity, load factor

Earlier we quickly sort out a bit to create a HashMap from the associated logic in key-value pairs, and now think about why we need to care about the capacity and the load factor of it?

This is because the capacity and load factor determines the number of available barrels, empty barrels too much waste space, if used too full will seriously affect the performance of the operation. In extreme cases, it is assumed that only a bucket, then it devolved into a linked list, it can not provide so-called constant-time performance storage. Since the capacity and load factor is so important, how should we choose to practice it?

If you want to know the key HashMap access number, may be considered appropriate capacity size is set in advance. We can do the specific values ​​depending on the condition of simple expansion occurred estimates, based on analysis of the previous code, we know that it needs to meet the computing requirements:

Number of load capacity factor *> element

Therefore, the capacity required to meet pre-set, is greater than "Estimated Number of elements / load factor", and it is a power of 2, the conclusion has been very clear.

For load factor, I recommend:

  • If there are no special needs, do not easily change, because the default JDK its load factor is in line with the needs of the common scenarios.

  • If you do need to be adjusted, it is recommended not to exceed 0.75 set value, as it will significantly increase the conflict and reduce the performance of HashMap.

  • If you use too small load factor, according to the above formula, the preset capacity value also be adjusted, it may cause more frequent expansion, increase in unnecessary overhead, itself access performance will also be affected.

Tree transformation

We mentioned earlier that the transformation of trees, mainly in the corresponding logic putVal and treeifyBin in.

    final void treeifyBin(Node<K,V>[] tab, int hash) {
       int n, index; Node<K,V> e;
       if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)
           resize();
       else if ((e = tab[index = (n - 1) & hash]) != null) {
           // 树化改造逻辑 
       }
   }
复制代码

The above is a stripped down schematic treeifyBin, combination of these two methods, transformation of the logic tree is very clear, as will be appreciated, when the bin is greater than the number of time TREEIFY_THRESHOLD:

  • If the capacity is less than MIN_TREEIFY_CAPACITY, only a simple expansion.
  • If the capacity is greater than MIN_TREEIFY_CAPACITY, tree transformation will be.

So why should HashMap tree it?

Essentially this is a security issue. Because the elements during placement, if an object hash collision, are placed into the same bucket, will form a linked list, we saw that lists the query is linear, it will seriously affect the performance of access.

Implement hash function

Then we look to achieve focus hash function:

    static final int hash(Object kye) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>>16);
    }
复制代码

This code is actually called a "disturbance function", Java Code 8 is relatively simplified Java 7 is done, but the principle is the same.

We all know that the above code is key.hashCode () function call is the key that comes with the key type of hash function returns an int hash value.

Theoretically hashCode is an int, int this value range is between -2147483648 and 2147483648, hashCode If you take this as a subscript in an array of HashMap direct access to it, under normal circumstances would not appear hash collision. But so will cause the array length of this HashMap relatively long length of approximately 4.0 billion memory is certainly not fit, so this time need to put this hashCode take the remainder of the array length, with the remainder resulting to access an array subscript.

So we see the HashMap get and put methods have the following code like this:

   tab[(n - 1) & hash]
复制代码

(N-1) & hash actually equivalent hash% (n-1) take the remainder, but & calculation faster.

Here we give an example: In an initial length of 16, for example, 16-1 = 15.2 hexadecimal 00000000 0,000,000,000,001,111. We do 15 and a hashCode value "and" follows, the result is the interception of the lowest four-digit value.

        10100101 11000100 00100101
&	00000000 00000000 00001111
----------------------------------
	00000000 00000000 00000101    //高位全部归零,只保留末四位
复制代码

So the question becomes, if only we take the case of the low, the impact will be very serious.

This time value "of the disturbance function" is manifested, we look at this picture below:

16-bit right shift, exactly half of 32bit, his upper half and lower half area XOR, is to mix high and low original hash code as a way to increase the randomness of the low. After mixing and low doped upper portion of the feature, such information is also high disguise retained.

Most of the distribution of such hashCode has been very good, even if it is also collided with O (logn) the tree to do it. Or just look different, not only reduces the overhead of the system, it will not because the underlying cause of lower calculated (table length is small) did not participate in high level, causing the collision.

These are the principles regarding the HashMap hash function design.

The principle of resize

When put, if we find the current bucket level of occupancy has exceeded Load Factor desired proportion, it will resize occur. In the course of the resize, simply means that the bucket expanded to 2 times, and then recalculate the index, the node and then put in a new bucket. resize notes described it this way:

Initializes or doubles table size. If null, allocates in accord with initial capacity target held in field threshold. Otherwise, because we are using power-of-two expansion, the elements from each bin must either stay at same index, or move with a power of two offset in the new table.

Roughly meaning that, when exceeding the limit will resize, but also because we are using twice extended power (referring to the length of the expansion of the original 2-fold), so the position of the element either in the original position, either then in the original location mobile power position 2.

Let this FIG above, the n-length table, view (a) shows an example of the index position before key1 and key2 two kinds of key determination expansion, (b) of a rear expansion key1 and key2 two kinds of key determination index position example, where hash1 is key1 corresponding hash value (i.e., according to the calculated hashcode key1 value) with the results of the calculation of the upper.

After recalculating the hash element, as n becomes two times, then n-1 mask range in the high multiple 1bit (red), so the new index will be such changes:

We resize time, not like the realization JDK1.7 as recalculated hash, only need to look at new original hash value that is 1 or 0 bit like, is 0, then the index has not changed, it is the word 1 index into "the original index + oldCap.

This is the place to resize cleverly designed, when the expansion, we do not need to re-hash value to calculate the position of the element is either in the original position, either move the power position 2 and then in the original position.

Article before himself Github KnowledgeSummary summarized in the notes, KnowledgeSummary primarily a study covering Android and repo face questions collected is continually updated, the welcome to star, follow.

Reference article

Java HashMap works and realization

The principle of the method HashMap hash

HashMap expansion mechanism

Guess you like

Origin juejin.im/post/5d241fe6e51d4556db694a86