Java type-HashMap-underlying principle

Overview

Other URL

HashMap design principle, HashMap data structure, HashMap source code implementation_three thousand words omitted here -CSDN blog
6 data structure-6.5 HashMap-"Simon's Technical Notes"-书Stack网·BookStack

Comparison of JDK1.7 and JDK1.8

data structure

Arrays and linked lists

There are arrays and linked lists in the data structure to store data, but both have their pros and cons.

item Array Linked list
Memory footprint Occupies a lot of memory. (The storage interval is continuous) Occupies a lot of memory. (The storage interval is not continuous)
Search speed fast. (The time complexity is small, O(1)) slow. (The time complexity is very large, O(N))
Insertion and deletion speed slow. fast.

Hash table

Hash table: The characteristics of integrated arrays and linked lists : easy to find (address), easy to insert and delete, and a data structure that occupies a medium space .

There are many different implementation methods for hash tables, and HashMap uses the zipper method, also known as the [chain address method].

 

The initial length of the array of the hash table is 16 , and each element stores the head node of a linked list. Then these elements are stored in the array according to what kind of rules. Generally by % len hash (key) is obtained, i.e. a hash value of the key elements of the array length modulo obtained. For example, in the above hash table:

12%16=12,28%16=12,108%16=12,140%16=12

So 12, 28, 108, and 140 are all stored in the 12 position of the array.

Access mechanism

Key-value pair data

        Each key-value pair is a Node<K,V> object, which implements the Map.Entry<K,V> interface. Node<K,V> has four attributes: hash, key, value, next (next node).

public class HashMap<K,V> extends AbstractMap<K,V>
    implements Map<K,V>, Cloneable, Serializable {
    static class Node<K,V> implements Map.Entry<K,V> {
        final int hash;
        final K key;
        V value;
        Node<K,V> next;
    }
    // 其他代码
}

Since it is a linear array, why can it be accessed randomly? Here HashMap uses a small algorithm, which is roughly implemented like this:

// 存储时:
int hash = key.hashCode(); // 这个hashCode方法这里不详述,只要理解每个key的hash是一个固定的int值
int index = hash % Entry[].length;
Entry[index] = value;
 
// 取值时:
int hash = key.hashCode();
int index = hash % Entry[].length;
return Entry[index];

put

Hash collision

If two keys get the same index through hash%Entry[].length, how to deal with it? The linked list used by HashMap. There is a next attribute in the Entry class, which points to the next Entry.

For example, the
         first key-value pair A comes in, and the index=0 obtained by calculating the hash of its key is recorded as: Entry[0] = A.
         After a while, a key-value pair B comes in. By calculating its index is also equal to 0, HashMap will do this: B.next = A, Entry[0] = B;
         if C comes in again, and index is also equal to 0, then C. next = B, Entry[0] = C; in
         this way, we find that the index=0 is actually used to access the three key-value pairs of A, B, and C using the header insertion method of the singly linked list, and they are linked together through the next attribute . So there will be overwriting, and the last element inserted is always stored in the array. So far, we should have been clear about the rough implementation of HashMap.

public V put(K key, V value) {
        if (key == null)
            return putForNullKey(value); //null总是放在数组的第一个链表中
        int hash = hash(key.hashCode());
        int i = indexFor(hash, table.length);
        //遍历链表
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            //如果key在链表中已存在,则替换为新value
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }
        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }
 
void addEntry(int hash, K key, V value, int bucketIndex) {
    Entry<K,V> e = table[bucketIndex];
    table[bucketIndex] = new Entry<K,V>(hash, key, value, e); //参数e, 是Entry.next
    //如果size超过threshold,则扩充table大小。再散列
    if (size++ >= threshold)
            resize(2 * table.length);
}

get

public V get(Object key) {
    if (key == null)
        return getForNullKey();
    int hash = hash(key.hashCode());
    //先定位到数组元素,再遍历该元素处的链表
    for (Entry<K,V> e = table[indexFor(hash, table.length)];
         e != null;
         e = e.next) {
        Object k;
        if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
            return e.value;
    }
    return null;
}

null key access

The null key is always stored in the first element of the Entry[] array.

private V putForNullKey(V value) {
    for (Entry<K,V> e = table[0]; e != null; e = e.next) {
        if (e.key == null) {
            V oldValue = e.value;
            e.value = value;
            e.recordAccess(this);
            return oldValue;
        }
    }
    modCount++;
    addEntry(0, null, value, 0);
    return null;
}

private V getForNullKey() {
    for (Entry<K,V> e = table[0]; e != null; e = e.next) {
        if (e.key == null)
            return e.value;
    }
    return null;
}

Determine the array

index:hashcode % table.length取模

When accessing HashMap, it is necessary to calculate which element of the Entry[] array the current key should correspond to, that is, to calculate the array subscript; the algorithm is as follows: 

// Returns index for hash code h.
static int indexFor(int h, int length) {
    return h & (length-1);
}

 Bitwise merge is equivalent to mod or remainder%; this means that the array subscripts are the same, but it does not mean that the hashCode is the same.

initial table size

public HashMap(int initialCapacity, float loadFactor) {
    .....
    // Find a power of 2 >= initialCapacity
    int capacity = 1;
    while (capacity < initialCapacity)
        capacity <<= 1;
    this.loadFactor = loadFactor;
    threshold = (int)(capacity * loadFactor);
    table = new Entry[capacity];
    init();
}

Note that the initial size of the table is not the initialCapacity in the constructor! !

But >= the n-th power of 2 of initialCapacity!

Hash collision

Other URL

Hash conflict and four solutions_Curiosity Big Bang-CSDN
Blog_Hash Conflict A preliminary understanding of hash conflict_Xiaozhe's Nothing Mirror-CSDN Blog_Hash Conflict
Data Structure and Algorithm: Hash Conflict Resolution-Know Almost

Introduction

 There are four ways to resolve hash conflicts

  • Open addressing
  • Rehashing
  • Chain address
  • Establish a public overflow area

1. Open addressing method

When the hash address p = H (key) of the keyword key conflicts, another hash address p1 is generated based on p, if p1 still conflicts, another hash address p2 is generated based on p, …Until a non-conflicting hash address pi is found, and the corresponding element is stored in it.

Hi = (H (key)+di)% m (i = 1,2 , n)

The open addressing method has the following three methods:

  1. Linear detection and hashing
    1. View the next unit in order, until an empty unit is found or the entire table is searched
    2. di = 1 ,2, 3 ,… , m-1
  2. Two (square) detection and hashing
    1. Jump detection on the left and right of the table until an empty cell is found or the entire table is searched
    2. di = 1 ^ 2 , -1 ^ 2,2 ^ 2 , -2 ^ 2 ,… , k ^ 2 , -k ^ 2 (k <= m / 2)
  3. Pseudo-random detection and hashing
    1. Build a pseudo-random number generator and give a random number as a starting point
    2. di=Pseudo-random number sequence. In actual implementation, a pseudo-random number generator should be established, (such as i=(i+p)% m), and a random number should be given as a starting point.

      For example, given that the length of the hash table m=11, the hash function is: H(key)= key% 11, then H(47)=3, H(26)=4, H(60)=5, suppose A keyword is 69, then H(69)=3, which conflicts with 47.

      If you use linear detection and then hash to deal with the conflict, the next hash address is H1=(3 + 1)% 11 = 4, if there is still a conflict, find the next hash address H2=(3 + 2)% 11 = 5 , There is still a conflict, continue to find the next hash address H3=(3 + 3)% 11 = 6. At this time, there is no conflict anymore, and 69 is filled in unit 5.

      If you use secondary detection and then hash to deal with the conflict, the next hash address is H1=(3 + 12)% 11 = 4. If there is still a conflict, find the next hash address H2=(3-12)% 11 = 2. There is no more conflict at this time, and 69 is filled in Unit 2.

      If using pseudo-random detection and then hashing to deal with the conflict, and the pseudo-random number sequence is: 2, 5, 9,..., then the next hash address is H1=(3 + 2)% 11 = 5, still conflicting , And look for the next hash address H2=(3 + 5)% 11 = 8. At this time, there is no conflict, and 69 is filled in unit 8.

advantage

  1. Easy to serialize
  2. If the total number of data can be predicted, a perfect hash sequence can be created

Disadvantage

  1. It takes up a lot of space. (In order to reduce conflicts, the open addressing method requires the filling factor α to be small, so a lot of space will be wasted when the node size is large)
  2. Deleting nodes is troublesome. You cannot simply set the space of the deleted node to be empty, otherwise the search path of the synonym node filled in the hash table after it will be truncated. This is because in various open address methods, empty address units (that is, open addresses) are conditions for search failure. Therefore, when performing a delete operation on a hash table that uses the open address method to handle conflicts, you can only mark the deleted node for deletion, but you cannot actually delete the node.

2. Re-hashing

Provide multiple hash functions. If the hash value of the key calculated by the first hash function conflicts, the second hash function is used to calculate the hash value of the key.

advantage

  1. Not easy to gather

Disadvantage

  1. Increased calculation time

3. Chain address method

For the same hash value, use a linked list to connect. ( HashMap uses this method )

advantage

  1. It is simple to deal with conflicts and there is no accumulation. That is, non-synonyms will never conflict, so the average search length is shorter;
  2. Suitable for situations where the total number often changes. (Because the node space on each linked list in the zipper method is dynamically applied for)
  3. Occupies a small space. The filling factor can be α≥1, and when the node is large, the pointer field added in the zipper method can be ignored
  4. The operation of deleting a node is easy to implement. Simply delete the corresponding node on the linked list.

Disadvantage

  1. The query efficiency is low. (Storage is dynamic, it takes more time to jump when querying)
  2. When the key-value can be predicted and there is no subsequent addition or modification operation, the performance of the open addressing method is better than that of the chain addressing method.
  3. Not easy to serialize

4. Establish a public overflow area

The hash table is divided into two parts: the basic table and the overflow table. All elements that conflict with the basic table are filled in the overflow table.

Expansion mechanism

Other URL

HashMap expansion mechanism --- resize () _ data structures and algorithms _ blog -CSDN blog Pan Jiannan the
expansion mechanism of HashMap - Jane books
expansion mechanism of HashMap ------ resize () _ java_IM_MESSI the blog -CSDN blog
hashMap expansion Mechanism_java_mengyue000's Blog -CSDN Blog

When to expand

        HashMap is lazy loading. After constructing the HashMap object, if you do not use put to insert elements, HashMap will not initialize or expand the table.

        When the put method is called for the first time, HashMap will find that the table is empty and then call the resize method to initialize.
        When the put method is not called for the first time, if HashMap finds that the size (array size) is greater than the threshold (the current array size multiplied by the load factor), the resize method will be called for capacity expansion.

        The array cannot be automatically expanded, so it can only be replaced with a larger array to fill the previous elements and the new elements that will be added.

Overview of resize()

  1. Determine whether the capacity of the old array before expansion has reached the maximum (2^30)
    1. If it is reached, the threshold is modified to the maximum value of Integer (2^31-1), and there will be no expansion in the future.
    2. If it is not reached, modify the array size to 2 times the original
  2. Create a new array with the new array size (Node<K, V>[])
  3. Move the data to the new array (Node<K, V>[])
    1. Not necessarily all nodes have to change positions.
      1. For example, the size of the original array is 16, and the size is 32 after expansion. If there are two data with hash values ​​of 1 and 17, their remainder of 16 is 1, and they are in the same bucket; after expansion, the remainder of 1 to 32 is still 1, while the remainder of 17 to 32 becomes 17. , Need to change location.
        1. The corresponding code is: if ((e.hash & oldCap) == 0) If it is true, there is no need to change the position.
  4. Return a new Node<K, V>[] array
class Initial capacity Maximum capacity Multiplier during expansion Load factor Low-level implementation
HashMap 2^4 2^30 n * 2 0.75 Map.Entry
HashSet 2^4 2^30 n * 2 0.75 HashMap<E,Object>
HashTable 11 Integer.MAX_VALUE - 8 n*2 + 1 0.75 Hashtable.Entry

        In HashMap, the length of the hash bucket array table must be 2 to the nth power (non-prime number). This is an unconventional design. The conventional design is to design the size of the bucket as a prime number. Relatively speaking, the probability of conflicts caused by prime numbers is less than that of non-prime numbers. For specific proof, please refer to http://blog.csdn.net/liuqiyao_01/article/details/14475159. The initial bucket size of Hashtable is 11, which is the application where the bucket size is designed as prime number. (Hashtable cannot be guaranteed to be a prime number after expansion).

         HashMap adopts this unconventional design, mainly to optimize the modulus and expansion, and to reduce conflicts, when HashMap locates the index position of the hash bucket, it also joins the process of high-level participation in the operation.

Source code

 HashMap#resize()

final Node<K,V>[] resize() {
    Node<K,V>[] oldTab = table;
    int oldCap = (oldTab == null) ? 0 : oldTab.length;
    int oldThr = threshold;
    int newCap, newThr = 0;
    if (oldCap > 0) {
        if (oldCap >= MAXIMUM_CAPACITY) {
            threshold = Integer.MAX_VALUE;
            return oldTab;
        }
        else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                 oldCap >= DEFAULT_INITIAL_CAPACITY)
            newThr = oldThr << 1; // double threshold
    }
    else if (oldThr > 0) // initial capacity was placed in threshold
        newCap = oldThr;
    else {               // zero initial threshold signifies using defaults
        newCap = DEFAULT_INITIAL_CAPACITY;
        newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
    }
    if (newThr == 0) {
        float ft = (float)newCap * loadFactor;
        newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                  (int)ft : Integer.MAX_VALUE);
    }
    threshold = newThr;
    @SuppressWarnings({"rawtypes","unchecked"})
        Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
    table = newTab;
    if (oldTab != null) {
        for (int j = 0; j < oldCap; ++j) {
            Node<K,V> e;
            if ((e = oldTab[j]) != null) {
                oldTab[j] = null;
                if (e.next == null)
                    newTab[e.hash & (newCap - 1)] = e;
                else if (e instanceof TreeNode) // 忽略这里的红黑树实现
                    ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                else { // preserve order
                    Node<K,V> loHead = null, loTail = null;
                    Node<K,V> hiHead = null, hiTail = null;
                    Node<K,V> next;
                    do {
                        next = e.next;
                        if ((e.hash & oldCap) == 0) { // 重点1:判断节点在resize之后是否需要改变在数组中的位置
                            if (loTail == null)
                                loHead = e;
                            else
                                loTail.next = e;
                            loTail = e;
                        }
                        else {
                            if (hiTail == null)
                                hiHead = e;
                            else
                                hiTail.next = e;
                            hiTail = e;
                        }
                    } while ((e = next) != null);
                    // 重点2:将某节点中的链表分割重组为两个链表:一个需要改变位置,另一个不需要改变位置
                    if (loTail != null) {
                        loTail.next = null;
                        newTab[j] = loHead;
                    }
                    if (hiTail != null) {
                        hiTail.next = null;
                        newTab[j + oldCap] = hiHead;
                    }
                }
            }
        }
    }
    return newTab;
}

Traversal method and its performance

Reference URL

Java HashMap three loop traversal performance analysis and comparative examples _java_ script homes
Java iterators (rpm) (Detailed as well as the difference between iterator and for cycling) - redcoatjk - blog Park

Traverse method

method

Description / Example

for each map.entrySet()

Map<String, String> map = new HashMap<String, String>();

for (Entry<String, String> entry : map.entrySet()) {

  entry.getKey();

  entry.getValue();

}

Call the set iterator of map.entrySet()

Iterator<Map.Entry<String, String>> iterator = map.entrySet().iterator();

while (iterator.hasNext()) {

  entry.getKey();

  entry.getValue();

}

for each map.keySet(), then call get to get

Map<String, String> map = new HashMap<String, String>();

for (String key : map.keySet()) {

  map.get(key);

}

Comparison of traversal methods

Performance test and comparison of three traversal methods

Test environment: Windows7 32-bit system 3.2G dual-core CPU 4G memory, Java 7, Eclipse -Xms512m -Xmx512m

Test Results:

map size 10,000 100,000 1,000,000 2,000,000
for each entrySet 2ms 6ms 36ms 91ms
for iterator entrySet 0ms 4ms 35ms 89ms
for each keySet 1ms 6ms 48ms 126ms

Result analysis of traversal mode (as can be seen from the above table):

  • for each entrySet and for iterator entrySet are equivalent in performance
  • for each keySet is time-consuming due to the need to call get(key) to obtain the value (if the hash algorithm is poor, it will be more time-consuming)
  • If you want to delete the map during the loop, you can only use for iterator entrySet (introduced in HahsMap Non-Thread Safety).

HashMap entrySet source code

private final class EntryIterator extends HashIterator<Map.Entry<K,V>> {
  public Map.Entry<K,V> next() {
    return nextEntry();
  }
}

HashMap keySet source code

private final class KeyIterator extends HashIterator<K> {
  public K next() {
    return nextEntry().getKey();
  }
}

Known from the source code:

keySet()与entrySet()都是返回set的迭代器。父类相同,只是返回值不同,因此性能差不多。只是keySet()多了一步根据key get value的操作而已。get的时间复杂度取决于for循环的次数,即hash算法。

public V get(Object key) {
  if (key == null)
    return getForNullKey();
  Entry<K,V> entry = getEntry(key);
  return null == entry ? null : entry.getValue();
}
/**
 1. Returns the entry associated with the specified key in the
 2. HashMap. Returns null if the HashMap contains no mapping
 3. for the key.
 */
final Entry<K,V> getEntry(Object key) {
  int hash = (key == null) ? 0 : hash(key);
  for (Entry<K,V> e = table[indexFor(hash, table.length)];
     e != null;
     e = e.next) {
    Object k;
    if (e.hash == hash &&
      ((k = e.key) == key || (key != null && key.equals(k))))
      return e;
  }
  return null;
}

 使用场景总结

方法

使用场景

for each map.entrySet()

循环中需要key、value,但不对map进行删除操作

调用map.entrySet()的集合迭代器

循环中需要key、value,且要对map进行删除操作

for each map.keySet()

循环中只需要key

hashCode方法

其他网址

Java String的hashcode()方法实现_timothytt的博客-CSDN博客

源码(String#hashCode)

String#hashCode 

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

为什么乘31呢?

选31是因为它是一个奇素(质)数,这里有两层意思:奇数 && 素数。

1.为什么是奇数,偶数不行?

    因为如果乘子是个偶数,并且当乘法溢出的时候(数太大,int装不下),相当于在做移位运算,有信息就损失了。

    比如说只给2bit空间,二进制的10,乘以2相当于左移1位,10(bin)<<1=00,1就损失了。

2.为什么是素数?

    作者说:你问我我问谁,这是传统吧。素数比较流弊。

    那么,问题又来了,那么多个奇素数,为什么就看上了31呢。

3.为什么偏偏是31?

    h*31 == (h<<5)-h; 现代虚拟机会自动做这样的优化,算得快。

    再反观这种“选美标准”下的其它数,

    h*7 == (h<<3)-h; // 太小了,容易hash冲突

    h*15 == (h<<4)-h; // 15不是素数

    h*31 == (h<<5)-h; // 31既是素数又不大不小刚刚好

    h*63 == (h<<6)-h; // 63不是素数

    h*127 == (h<<7)-h; // 太大了,乘不到几下就溢出了

实例追踪

"abc".hashCode()

hash为:0
value为:[a, b, c]

第一次循环:h = 0 + 97 = 97

第二次循环:h = 31 * 97 + 98 = 3105

第三次循环:h = 3105 * 31 + 99 = 96354

Guess you like

Origin blog.csdn.net/feiying0canglang/article/details/115184790