【2023】HashMap detailed source code analysis and interpretation

Preface

Before figuring out HashMap, let's first introduce the data structure used. After jdk1.8, the red-black tree data structure was added to HashMap in order to optimize efficiency.

Tree

In computer science, a tree is an abstract data type (ADT) or a data structure that implements this abstract data type , used to simulate a data collection with tree-like structure properties. It consists of n (n>0) finite nodes forming a set with hierarchical relationships . It's called a "tree" because it looks like an upside-down tree, meaning it has the roots pointing up and the leaves pointing down. It has the following characteristics:

  • Each node has only a limited number of child nodes or no child nodes;
  • A node without a parent node is called a root node;
  • Each non-root node has one and only one parent node;
  • In addition to the root node, each child node can be divided into multiple disjoint subtrees;
  • There are no cycles in the tree
    Insert image description here

The classification includes binary trees, binary search trees, red-black trees, B-trees, B+ trees, etc.

1. Binary tree

  • Each node contains at most two child nodes, namely the left child node and the right child node.
  • Each node is not required to have two child nodes. Some only have left children and some have only right children.
  • The left subtree and right subtree of each node of the binary tree also satisfy the first two definitions respectively.
    Insert image description here

2. Binary search tree

  • At any node in the tree, the value of each node in its left subtree must be less than the value of this node, and the value of the right subtree node must be greater than the value of this node.
  • There will be no nodes with equal key values.
  • Normally the time complexity of binary tree search is O(log n)
    Insert image description here
    Because it cannot spin, a worst-case scenario will occur, where the left and right subtrees are extremely unbalanced.
    Insert image description here

3. Red and black trees

  • Nodes are either red or black
  • Follow the node is black
  • Leaf nodes are all black empty nodes
  • The child nodes of a red node in a red-black tree are all black.
  • All paths from any node to a leaf node contain the same number of black nodes.
  • After adding or deleting, if the above five definitions are not met, a rotation adjustment operation will occur
  • The time complexity of search, delete, and add operations is O(log n)
    Insert image description here

hash table

A hash table, also known as a hash table , is a data structure that directly accesses the value (value) at a memory storage location based on a key. It is evolved from an array and uses the feature of an array

to support random access to data based on subscripts. A function that maps keys to array subscripts is called a hash function . Can be expressed as: hashValue = hash(key)
The basic requirements of the hash function:

  • The hash value calculated by the hash function must be a positive integer greater than or equal to 0, because hashValue needs to be used as the subscript of the array.
  • If key1 = key2, then the hash value obtained after hashing must also be the same: hash(key1) = hash(key2)
  • If key1 != key2, then the hash value obtained after hashing must also be the same: hash(key1) != hash(key2)

Hash collision

In practice, it is almost impossible to find a hash function that can calculate different hash values ​​for different keys. As a result, there will be a phenomenon that multiple keys are mapped to the same array subscript position after being converted by hash operation. This situation is called hash conflict (or hash conflict, hash collision)
Insert image description here

zipper method

In order to resolve hash conflicts, a method called zipper method is generally used.
In the hash table, each subscript position of the array can be called a bucket. Each bucket corresponds to a linked list. We put all the elements with the same hash value into the linked list corresponding to the same slot. This method is called zipper method

  • In the insertion operation, the corresponding hash slot is calculated through the hash function and inserted into the corresponding linked list. The time complexity of the insertion is O(1)
  • When searching and deleting an element, we also calculate the corresponding slot through the hash function, and then traverse the linked list to find or delete
    • On average, the time complexity of the query when resolving conflicts based on the linked list method is O(1)
    • The hash table may degenerate into a linked list, and the time complexity of the query degenerates from O(1) to O(n)
    • Transform the linked list in the linked list method into other efficient dynamic data structures, such as red-black trees, and the time complexity of the query is O(log n)
      Insert image description here
      Insert image description here

And using red-black trees can effectively prevent DDos attacks.

1. Introduction

HashMapIt is an important implementation class in Map. It is a hash table, and the stored content is a key-value pair (key=>value) mapping. HashMap is not thread-safe. HashMap allows storing null keys and values, and keys are unique.

Before JDK1.8, HashMapthe underlying data structure was a pure array + linked list structure. Since arrays have the characteristics of fast reading and slow addition and deletion, while linked lists have the characteristics of slow reading and fast addition and deletion, HashMap combines the two without using synchronization locks for modification, so its performance is better. Array is HashMapthe main body, and linked list is introduced to solve hash conflicts. The specific method to resolve hash conflicts here is: zipper method

2. Source code analysis

1. put method

1.1. Common attributes

Expansion threshold = array capacity * loading factor

    //默认的初始容量
    static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16  
    //默认的加载因子     
    static final float DEFAULT_LOAD_FACTOR = 0.75f;
    //存储数据的数组
    transient Node<K,V>[] table;
    //容量
    transient int size;    
    

1.2. Constructor

    //默认无参构造
    public HashMap() {
    
    
        this.loadFactor = DEFAULT_LOAD_FACTOR; // 指定加载因子为默认加载因子 0.75
    }
  • HashMap creates an array lazily, and does not initialize the array when creating an object
  • In the parameterless constructor, the default loading factor is set

1.3. put method

  • flow chart
    Insert image description here
  • Specific source code
    
	public V put(K key, V value) {
    
    
        return putVal(hash(key), key, value, false, true);
    }
    
	/** 
	*  计算hash值的方法
	*/
		static final int hash(Object key) {
    
    
	        int h;
	        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
	    }

	/** 
	*  具体执行put添加方法
	*/
	final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
    
    
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        //判断数组是否初始化(数组初始化是在第一次put的时候)
        if ((tab = table) == null || (n = tab.length) == 0)
        	//如果未初始化,调用resize()进行初始化
            n = (tab = resize()).length;
        //通过 & 运算符计算求出该数据(key)的数组下标并且判断该下标位置是否有数据
        if ((p = tab[i = (n - 1) & hash]) == null)
        	//如果没有,直接将数据放在该下标位置
            tab[i] = newNode(hash, key, value, null);
        else {
    
      //该下标位置有数据的情况
            Node<K,V> e; K k;
            //判断该下标位置的数据是否和当前新put的数据一样
            if (p.hash == hash &&
                ((k = p.key) == key || (key != null && key.equals(k))))
				//如果一样,则直接覆盖value
                e = p;
            //判断是不是红黑树
            else if (p instanceof TreeNode)  
            	//如果是红黑树的话,进行红黑树的具体添加操作
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            //如果都不是代表是链表
            else {
    
      
            	//遍历链表
                for (int binCount = 0; ; ++binCount) {
    
    
                	//判断next节点是否为null,是null代表遍历到链表尾部了
                    if ((e = p.next) == null) {
    
    
                    	//把新值插入到尾部
                        p.next = newNode(hash, key, value, null);
                        //插入数据后,判断链表长度有大于等于8了没
                        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                        	//如果是则进行红黑树转换
                            treeifyBin(tab, hash);
                        break; //退出
                    }
                    //如果在链表中找到相同数据的值,则进行修改操作
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    //把下一个节点赋值为当前节点
                    p = e;
                }
            }
            //判断e是否为null(e值为前面修改操作存放原数据的变量)
            if (e != null) {
    
     // existing mapping for key
            	//不为null的话证明是修改操作,取出老值
                V oldValue = e.value;
               
                if (!onlyIfAbsent || oldValue == null)
                	//把新值赋值给当前节点
                    e.value = value;
                afterNodeAccess(e);
                //返回老值
                return oldValue;
            }
        }
        //计算当前节点的修改次数
        ++modCount;
        //判断当前数组中的数据量是否大于扩容阈值
        if (++size > threshold)
        	//进行扩容
            resize();
        afterNodeInsertion(evict);
        return null;
    }
  • specific process
  1. Judging whether the key-value pair array table is ull, complex execution resize() to expand (initialize)
  2. Calculate the hash value according to the key-value pair key to get the array index
  3. Judge table[i]==hash value to get array index
  4. If taale[i] == null, the condition is established, directly create a new node and add
    i. Determine whether the first element of table[i] is the same as key. If the same, directly overwrite value
    ii. Determine whether table[i] is a treeNode, that is, table Is [i] a red-black tree? If it is a red-black tree, directly insert the key-value pair into the number
    iii. Traverse table[i], insert data at the end of the linked list, and then determine whether the length of the linked list is greater than 8. If so, convert the linked list It is a red-black tree operation. If it is found that the key already exists during the traversal process, the value will be overwritten.

1.4 resize method (capacity expansion)

  • flow chart
    Insert image description here
  • Specific source code
    final Node<K,V>[] resize() {
    
    
        Node<K,V>[] oldTab = table;
        //如果当前数组为null的时候,把oldCap 老数组容量设置为0
        int oldCap = (oldTab == null) ? 0 : oldTab.length;
        //老的扩容阈值
        int oldThr = threshold;
        int newCap, newThr = 0;
        //判断数组容量是否大于0,大于0说明数组已经初始化
        if (oldCap > 0) {
    
    
        	//判断当前数组长度是否大于最大数组长度
            if (oldCap >= MAXIMUM_CAPACITY) {
    
    
            	//如果是,将扩容阈值直接设置为int类型的最大数值并且直接返回
                threshold = Integer.MAX_VALUE;
                return oldTab;
            }
            //如果在最大长度访问内,则需要扩容oldCap << 1 == oldCap * 2
            //并且判断是否大于16,
            else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                     oldCap >= DEFAULT_INITIAL_CAPACITY)
                newThr = oldThr << 1; // double threshold  等价于 oldCap * 2
        }
      
        else if (oldThr > 0) // initial capacity was placed in threshold
            newCap = oldThr;
        //数组初始化的情况,将阈值和扩容因子设置为默认值
        else {
    
                   // zero initial threshold signifies using defaults
            newCap = DEFAULT_INITIAL_CAPACITY;
            newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
        }
        //初始化容量小于16的时候,扩容阈值没用阈值的
        if (newThr == 0) {
    
    
        	//创建阈值
            float ft = (float)newCap * loadFactor;
            newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                      (int)ft : Integer.MAX_VALUE);
        }
        //计算出来的阈值赋值
        threshold = newThr;
        @SuppressWarnings({
    
    "rawtypes","unchecked"})
        //根据上边计算得出的容量 创建新的数组
        Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
        table = newTab;
        //扩容操作,判断不为null证明不是初始化数组
        if (oldTab != null) {
    
    
        //	遍历数组
            for (int j = 0; j < oldCap; ++j) {
    
    
                Node<K,V> e;
                //判断当前下标为j的数组如果不为null的话赋值给e
                if ((e = oldTab[j]) != null) {
    
    
                	//将数组的位置设置为null
                    oldTab[j] = null;
                    //判断是否有下一个节点
                    if (e.next == null)
                    	//如果没有,就查询计算在新数组中的下标并放进去
                        newTab[e.hash & (newCap - 1)] = e;
                    //有下个节点的情况,并且判断是否已经树化
                    else if (e instanceof TreeNode)
                    	//进行红黑树的操作
                        ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                   //有下个节点的情况,并且判还没有树化  
                    else {
    
     // preserve order
                        Node<K,V> loHead = null, loTail = null;  //低位数组
                        Node<K,V> hiHead = null, hiTail = null;  //高位数组
                        Node<K,V> next;
                        遍历循环
                        do {
    
    
                        	//取出next节点
                            next = e.next;
                            //通过 & 操作计算出结果为0
                            if ((e.hash & oldCap) == 0) {
    
    
                            	//如果低位为null,则把e值放入低位2头
                                if (loTail == null)
                                    loHead = e;
                                //低位尾不是null,
                                else
                                	//将数据放入next节点
                                    loTail.next = e;
                                loTail = e;
                            }
                            
                            else {
    
    
                                if (hiTail == null)
                                    hiHead = e;
                                else
                                    hiTail.next = e;
                                hiTail = e;
                            }
                        } while ((e = next) != null);
                        //低位如果记录的有数据,是链表
                        if (loTail != null) {
    
    
                        //将下一个元素置空
                            loTail.next = null;
                            //将低位头放入新数组的
                            newTab[j] = loHead;
                        }
                        //高位尾如果记录有数据,是链表
                        if (hiTail != null) {
    
    
                        	//将下个元素置空
                            hiTail.next = null;
                            //将高位头放入新数组的(原下标+原数组容量)位置
                            newTab[j + oldCap] = hiHead;
                        }
                    }
                }
            }
        }
        return newTab;
    }
  • Execution principle
  1. When adding elements or initializing, you need to call the resize method for expansion. The first time you add data, the initial array length is 16. Each subsequent expansion will reach the expansion threshold (array length * 0.75).
  2. Each time the capacity is expanded, the capacity will be twice the capacity before expansion;
  3. After expansion, a new array will be created, and the data in the old array needs to be moved to the new array
    i. For nodes without hash conflicts, directly use e.hash&(newCap-1) to calculate the index position of the new array
    ii. If It is a red-black tree, and the addition of the red-black tree
    iii. If it is a linked list, you need to traverse the linked list, and you may need to split the linked list to determine whether (e.hash&oldCap) is 0. The position of the element will either stay at the original position or move to the original position + the increased array size.

How to re-determine the position of the element in the array when expanding, we see that it is determined by if ((e.hash & oldCap) == 0).

hash HEX(97)  = 0110 0001‬ 
n    HEX(16)  = 0001 0000
--------------------------
         结果  = 0000 0000
# e.hash & oldCap = 0 计算得到位置还是扩容前位置
     hash HEX(17)  = 0001 0001‬ 
     n    HEX(16)  = 0001 0000
--------------------------
         结果  = 0001 0000
#  e.hash & oldCap != 0 计算得到位置是扩容前位置+扩容前容量

get method

public V get(Object key) {
    
    
    // 定义一个Node结点
    Node<K,V> e;
    return (e = getNode(hash(key), key)) == null ? null : e.value;
}

final Node<K,V> getNode(int hash, Object key) {
    
    
    Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
    if ((tab = table) != null && (n = tab.length) > 0 &&
        (first = tab[(n - 1) & hash]) != null) {
    
    
        // 数组中元素相等的情况
        if (first.hash == hash && // always check first node
            ((k = first.key) == key || (key != null && key.equals(k))))
            return first;
        // bucket中不止一个结点
        if ((e = first.next) != null) {
    
    
            //判断是否为TreeNode树结点
            if (first instanceof TreeNode)
                //通过树的方法获取结点
                return ((TreeNode<K,V>)first).getTreeNode(hash, key);
            do {
    
    
                //通过链表遍历获取结点
                if (e.hash == hash &&
                    ((k = e.key) == key || (key != null && key.equals(k))))
                    return e;
            } while ((e = e.next) != null);
        }
    }
    // 找不到就返回null
    return null;
}

common problem

1. How to calculate the index? The hashCode is available, why use the hash() method? Why is the array capacity equal to the n power of 2?

  • First calculate the hashCode() of the key, then call Hash()the method and use the XOR method to perturb the operation for secondary hashing, and finally (n - 1) & hashobtain the index through the AND operation. (Using this method is equivalent to hash% n modulo operation)

  • The second hash() is to synthesize the high-order data of the binary, make the hash distribution more uniform, and reduce the probability of hash collision. The calculation formula is:(h = key.hashCode()) ^ (h >>> 16)

  • To calculate the index, if it is the nth power of 2, you can use bitwise AND operation instead of modulo, which is more efficient; and when the capacity is expanded, the elements with hash & lodCap == 0 will remain at the original position, and those with 1 will be returned to the expanded position. new position, new position = old position + lodCap

    • 计算方式: hash & lengthUse the secondary hash value and the original capacity for calculation. If the result is 0, the position will not change. If it is not 0, it will move to a new position.
    • New way to calculate location:原始数组容量+原始下标=新的位置
  • The use of the nth power of 2 is mainly to better cooperate with the optimization efficiency, so that the distribution of subscripts is more uniform

2. What is the difference between 1.7 and 1.8 in the put method process of HashMap?

  1. HashMap creates an array lazily, and the array is created only when it is used for the first time
  2. Calculate index (bucket subscript)
    1. First, get the hash value of the key, and then calculate the secondary hash value through the hash() method. **The calculation method is: ** (h = key.hashCode()) ^ (h >>> 16)Shift the hash value to the right through unsigned, and then perform XOR calculation with the original hash value. , this function is mainly to disrupt the lower 16 bits that actually participate in the calculation, which can effectively disrupt the operation and reduce the probability of hash collision.
    2. Then use the entire secondary hash value and the capacity of the array to divide and leave the remainder method. The remainder obtained is the final bucket subscript. The calculation method is: take(n-1)&hash the length of the array -1 and perform an AND operation with the hash value. It is equivalent to using the hash value to make the remainder with the length n of the array % %Operation, using & to do operations is mainly to effectively improve the efficiency of operations (1.7 does not have this optimization)
  3. If the bucket subscript is not occupied by anyone, create a Node node and return
  4. If the bucket subscript is already occupied: it will be compared with each node one by one to see if the hash value and equals() are relative. If they are equal, it means the same key, then overwrite and modify it. If they are not the same, add it.
    1. When the TreeNode has become popular, add or update the logic of the black tree
    2. If it is an ordinary Node, it will use the add or update logic of the linked list. If the length of the linked list exceeds the tree threshold of 8, it will use the tree logic and perform the tree operation (the prerequisite is that the array length reaches 64)
  5. Before returning, it will also check whether the capacity exceeds the expansion threshold (array length/load factor), and if it exceeds, it will expand the capacity
  6. different:
    1. When inserting a linked list, 1.7 uses the head insertion method (insert from the head of the linked list), and 1.8 uses the tail insertion method (insert from the tail of the linked list)
      1. 1.7 means expansion when it is greater than or equal to the threshold and there is no space (if there is space, it will not be expanded but will continue to be placed at the calculated bucket index), while 1.8 means expansion when it is greater than or equal to the threshold.
      2. 1.8 When expanding the computing Node node, it will optimize

Guess you like

Origin blog.csdn.net/weixin_52315708/article/details/131918897