HashMap implementation principle and java8 changes

Preface

Most JAVA developers are using Maps, especially HashMaps. HashMap is a simple but powerful way to store and retrieve data. But, how many developers know how HashMap works internally?
.


1. Internal memory

The JAVA HashMap class implements the interface Map <K, V>. The main methods of this interface are:

  • V put(K key, V value)
  • V get(Object key)
  • V remove(Object key)
  • Boolean containsKey(Object key)

HashMaps uses an internal class to store data: Entry <K, V>. This item is a simple key-value pair, which contains two additional data:

  • A reference to another Entry, so that HashMap can store entries such as a single linked list
  • The hash value that represents the hash value of the key. The hash value is stored to avoid hash calculation every time the HashMap needs to be hashed.

This is part of the Entry implementation in Java 7:

static class Entry<K,V> implements Map.Entry<K,V> {
    
    
        final K key;
        V value;
        Entry<K,V> next;
        int hash;
}

HashMap stores data in multiple single-linked item lists (also called buckets or bins ). All lists are registered in the Entry array (Entry<K, V> [] array), and the default capacity of this internal array is 16.

The figure below shows the internal storage of a HashMap instance with an array of nullable entries.  Each item can be linked to another item to form a linked list.
All keys with the same hash value are placed in the same linked list (bucket). Keys with different hash values ​​can end up in the same bucket.

When the user calls put (K key, V value) or get (Object key), this function will calculate the index of the storage area where Entry is located. Then, the function traverses the list to find the Entry with the same key (using the equals() function of the key).

For get(), the function returns the value associated with the entry (if the entry exists).

For put (K key, V value), if the entry exists, the function replaces it with the new value, otherwise it will create a new entry at the beginning of the single-linked list (according to the value in the key and parameter).

The bucket index (list of links) is generated by the map in 3 steps:

  • It first obtains the hash code of the key .
  • It rearranges the hash code to prevent corruption of the hash function in the key that puts all data into the same index (bucket) of the internal array
  • It passes through the reformer acquired hash hash code, and using the length of the array (minus 1) subjected to bit mask . This operation ensures that the index cannot be greater than the size of the array. You can think of it as a computationally optimized modular function.

This is the JAVA 7 and 8 source code for handling indexes:

// the "rehash" function in JAVA 7 that takes the hashcode of the key
static int hash(int h) {
    
    
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}
// the "rehash" function in JAVA 8 that directly takes the key
static final int hash(Object key) {
    
    
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }
// the function that returns the index from the rehashed hash
static int indexFor(int h, int length) {
    
    
    return h & (length-1);
}

In order to work effectively, the size of the internal array needs to be a power of 2. Let's see why.

Assuming the array size is 17, the mask value is 16 (size is -1). The binary representation of 16 is 0...010000, so for any hash value H, the index generated by the bitwise formula "H AND 16" will be 16 or 0. This means that an array of size 17 will only be used for 2 buckets: the one with index 0 and the one with index 16, which is not efficient...

However, if you are now using a power of 2 (such as 16), the bitwise index formula is "H AND 15". The binary representation of 15 is 0...001111, so the index formula can output values ​​from 0 to 15, and the array of size 16 is fully used. E.g:

  • If H = 952, its binary representation is 0..0111011 1000 , and the associated index is 0…0 1000  = 8
  • If H = 1576 and its binary representation is 0..01100010 1000 , then the associated index is 0…0 1000  = 8
  • If H = 12356146, its binary representation is 0..010111100100010100011 0010 , and the associated index is 0...0 0010 = 2
  • If H = 59843, its binary representation is 0..0111010011100 0011 , and the associated index is 0…0 0011  = 3

This is why the array size is a power of 2. This mechanism is transparent to the developer: if he chooses a HashMap of size 37, Map will automatically select the next power of 2 after 37 (64) for the size of its internal array.

To be continued...

Guess you like

Origin blog.csdn.net/plqaxx/article/details/108732084