The implementation of hash algorithm and hashmap in java

In the interview, I was asked about the implementation of the hash algorithm in java.. I also asked for handwritten code...

Record it later

/**
 * Hash algorithm for strings in JAVA
*/
public static int java(String str){
	int h = 0;
	for (char c : str.toCharArray()) h = 31 * h + c;
	return h;
}

 

 

 Reprinted from: http://alex09.iteye.com/blog/539545/

 

 

  HashMap and HashSet are two important members of the Java Collection Framework. HashMap is a common implementation class of the Map interface, and HashSet is a common implementation class of the Set interface. Although the interface specifications implemented by HashMap and HashSet are different, their underlying Hash storage mechanism is exactly the same, and even HashSet itself is implemented by HashMap. 

Analyze the Hash storage mechanism through the source code of HashMap and HashSet


In fact, there are many similarities between HashSet and HashMap. For HashSet, the system uses Hash algorithm to determine the storage location of set elements, which can ensure fast storage and retrieval of set elements; for HashMap, the system key-value When processed as a whole, the system always calculates the storage location of the key-value according to the Hash algorithm, which can ensure that the key-value pair of the Map can be quickly stored and retrieved.

Before introducing the collection storage, it is necessary to point out that although the collection claims to store Java objects, it does not actually put the Java objects into the Set collection, but only keeps the references of these objects in the Set collection. That is: a Java collection is actually a collection of reference variables that point to actual Java objects. 

Collections and references are 

like arrays of reference types. When we put Java objects into an array, we don't really put Java objects into the array, but just put the reference of the object into the array. Each array element is a reference variable. 

Storage implementation of HashMap


When the program tries to put multiple key-values ​​into HashMap, take the following code snippet as an example: 

 

HashMap<String , Double> map = new HashMap<String , Double>();   
map.put("language" , 80.0);   
map.put("数学" , 89.0);   
map.put("English" , 78.2);
 

 

HashMap uses a so-called "Hash algorithm" to decide where to store each element. 

When the program executes map.put("Language", 80.0);, the system will call the "Language" hashCode() method to get its hashCode value - every Java object has a hashCode() method, which can be obtained through this method its hashCode value. After getting the hashCode value of the object, the system will determine the storage location of the element according to the hashCode value. 

 


We can look at the source code of the put(K key, V value) method of the HashMap class: 

 

 

public V put(K key, V value)   
{   
 // If the key is null, call the putForNullKey method for processing  
 if (key == null)   
     return putForNullKey(value);   
 // Calculate the Hash value based on the keyCode of the key  
 int hash = hash(key.hashCode());   
 // Search the index of the specified hash value in the corresponding table  
     int i = indexFor(hash, table.length);  
 // If the Entry at the i index is not null, loop through the next element of the e element  
 for (Entry<K,V> e = table[i]; e != null; e = e.next)   
 {   
     Object k;   
     // Find the specified key is equal to the key that needs to be put in (the hash value is the same  
     // put back true via equals comparison)  
     if (e.hash == hash && ((k = e.key) == key   
         || key.equals(k)))   
     {   
         V oldValue = e.value;   
         e.value = value;   
         e.recordAccess(this);   
         return oldValue;   
     }   
 }   
 // If the Entry at index i is null, there is no Entry here   
 modCount++;   
 // Add key and value to the i index  
 addEntry(hash, key, value, i);   
 return null;   
}   
 

The above program uses an important internal interface: Map.Entry, each Map.Entry is actually a key-value pair. It can be seen from the above program that when the system decides to store the key-value pair in the HashMap, it does not consider the value in the Entry at all, but only calculates and determines the storage location of each Entry according to the key. This also illustrates the previous conclusion: we can completely regard the value in the Map collection as a subsidiary of the key. After the system determines the storage location of the key, the value can be stored there. 

The above method provides a method to calculate the Hash code according to the return value of hashCode(): hash(), this method is a pure mathematical calculation, and the method is as follows: 
  

static int hash(int h)   
{   
    h ^= (h >>> 20) ^ (h >>> 12);   
    return h ^ (h >>> 7) ^ (h >>> 4);   
}   
 

 



For any given object, as long as its hashCode() returns the same value, the hash code value calculated by the program calling the hash(int h) method is always the same. Next, the program calls the indexFor(int h, int length) method to calculate the index of the table array where the object should be stored. The code for the indexFor(int h, int length) method is as follows:

 

static int indexFor(int h, int length) {   
    return h & (length-1);   
}
 

 


This method is very clever, it always obtains the storage location of the object through h &(table.length -1) - and the length of the underlying array of HashMap is always 2 to the nth power, which can be seen later on HashMap constructor 's introduction. 

When length is always a multiple of 2, h & (length-1) will be a very clever design: suppose h=5,length=16, then h & length - 1 will get 5; if h=6,length= 16, then h & length - 1 will get 6 ... if h=15, length=16, then h & length - 1 will get 15; but when h=16, length=16, then h & length - 1 Will get 0; when h=17, length=16, then h & length - 1 will get 1... This ensures that the calculated index value is always within the index of the table array. 

According to the source code of the put method above, when the program tries to put a key-value pair into the HashMap, the program first determines the storage location of the Entry according to the return value of the hashCode() of the key: if the keys of two Entry The hashCode() returns the same value, so they are stored in the same location. If the keys of the two Entry return true through equals comparison, the value of the newly added Entry will overwrite the value of the original Entry in the collection, but the key will not. If the keys of these two Entry return false through equals comparison, the newly added Entry will form an Entry chain with the original Entry in the collection, and the newly added Entry is located at the head of the Entry chain - continue to see the description of the addEntry() method for details. . 

When adding a key-value pair to the HashMap, the storage location of the key-value pair (that is, the Entry object) is determined by the return value of its key's hashCode(). When the hashCode() return values ​​of the keys of two Entry objects are the same, the key will compare the values ​​through eqauls() to decide whether to use the overwrite behavior (return true) or generate an Entry chain (return false). 

The addEntry(hash, key, value, i); code is also called in the above program, where addEntry is a package access method provided by HashMap, which is only used to add a key-value pair. Here is the code for the method: 

void addEntry(int hash, K key, V value, int bucketIndex) {   
    // Get the Entry at the specified bucketIndex   
    Entry<K,V> e = table[bucketIndex];  // ①  
    // Put the newly created Entry at the bucketIndex index, and let the new Entry point to the original Entry   
    table[bucketIndex] = new Entry<K,V>(hash, key, value, e);   
    // if the number of key-value pairs in the Map exceeds the limit  
    if (size++ >= threshold)   
        // Extend the length of the table object to 2 times.  
        resize(2 * table.length);    // ②  
}   
 

 



The code of the above method is very simple, but it contains a very elegant design: the system always puts the newly added Entry object into the bucketIndex index of the table array - if there is already an Entry object at the bucketIndex index, then the newly added Entry object is added. The Entry object points to the original Entry object (generates an Entry chain), if there is no Entry object at the bucketIndex index, that is, the e variable of the code of the above program ① is null, that is, the newly placed Entry object points to null, that is No Entry chain is generated. 

JDK source code 

You can find a src.zip compressed file in the JDK installation directory, which contains all the source files of the Java basic class library. As long as the reader is interested in learning, you can open this compressed file at any time to read the source code of the Java class library, which is very helpful to improve the reader's programming ability. It should be pointed out that the source code contained in src.zip does not contain Chinese comments like the above, these comments are added by the author himself. 

The performance options of the Hash algorithm 

can be seen from the above code, when the same bucket stores the Entry chain, the newly placed Entry is always located in the bucket, and the earliest Entry placed in the bucket is located at the end of the Entry chain. end. 

There are also two variables in the above program: 

    * size: This variable holds the number of key-value pairs contained in the HashMap. 
    * threshold: This variable contains the limit of key-value pairs that the HashMap can hold, and its value is equal to the capacity of the HashMap multiplied by the load factor. 

As can be seen from the code ② in the above program, when size++ >= threshold, HashMap will automatically call the resize method to expand the capacity of HashMap. Each time it is expanded, the capacity of the HashMap is doubled. 

The table used in the above program is actually an ordinary array, each array has a fixed length, and the length of this array is the capacity of the HashMap. HashMap includes the following constructors: 

    * HashMap(): Construct a HashMap with an initial capacity of 16 and a load factor of 0.75. 
    * HashMap(int initialCapacity): Build a HashMap with an initial capacity of initialCapacity and a load factor of 0.75. 
    * HashMap(int initialCapacity, float loadFactor): Create a HashMap with the specified initial capacity and specified load factor. 

When creating a HashMap, the system will automatically create a table array to save the Entry in the HashMap, the following is the code of a constructor in the HashMap: 


The bolded code in the above code contains a concise code implementation: find the smallest n-th power of 2 value greater than initialCapacity, and use it as the actual capacity of the HashMap (stored by the capacity variable). For example, given an initialCapacity of 10, the actual capacity of the HashMap is 16. 
You can see from the code at code ①: the essence of table is an array, an array with a length of capacity. 

For HashMap and its subclasses, they use the Hash algorithm to determine the storage location of elements in the collection. When the system starts to initialize the HashMap, the system will create an Entry array with a length of capacity. The location where elements can be stored in this array is called a "bucket". Each bucket has its specified index, and the system can use its index according to its specific index. Quickly access elements stored in this bucket. 

At any time, each "bucket" of HashMap stores only one element (that is, one Entry), since the Entry object can contain a reference variable (that is, the last parameter of the Entry constructor) to point to the next Entry, it may be What happens is: There is only one Entry in the bucket of HashMap, but this Entry points to another Entry - this forms an Entry chain. As shown in Figure 1: 

 

Figure 1. HashMap storage schematic 

HashMap reading implementation 

When the Entry stored in each bucket of the HashMap is only a single Entry - that is, when the Entry chain is not generated through the pointer, the HashMap at this time has the best performance: when the program retrieves the corresponding value through the key, the system only needs to first calculate the The key's hashCode() return value, find the key's index in the table array according to the hashCode return value, then take out the Entry at the index, and finally return the value corresponding to the key. Look at the get(K key) method code of the HashMap class: 

public V get(Object key)   
{   
 // If the key is null, call getForNullKey to retrieve the corresponding value   
 if (key == null)   
     return getForNullKey();   
 // Calculate the hash code of the key based on its hashCode value  
 int hash = hash(key.hashCode());   
 // Directly fetch the value at the specified index in the table array,  
 for (Entry<K,V> e = table[indexFor(hash, table.length)];   
     e != null;   
     // Search for the next Entr in the chain of Entry   
     e = e.next)         // ①  
 {   
     Object k;   
     // if the key of the Entry is the same as the searched key  
     if (e.hash == hash && ((k = e.key) == key   
         || key.equals(k)))   
         return e.value;   
 }   
 return null;   
}   

 



从上面代码中可以看出,如果 HashMap 的每个 bucket 里只有一个 Entry 时,HashMap 可以根据索引、快速地取出该 bucket 里的 Entry;在发生“Hash 冲突”的情况下,单个 bucket 里存储的不是一个 Entry,而是一个 Entry 链,系统只能必须按顺序遍历每个 Entry,直到找到想搜索的 Entry 为止——如果恰好要搜索的 Entry 位于该 Entry 链的最末端(该 Entry 是最早放入该 bucket 中),那系统必须循环到最后才能找到该元素。 

归纳起来简单地说,HashMap 在底层将 key-value 当成一个整体进行处理,这个整体就是一个 Entry 对象。HashMap 底层采用一个 Entry[] 数组来保存所有的 key-value 对,当需要存储一个 Entry 对象时,会根据 Hash 算法来决定其存储位置;当需要取出一个 Entry 时,也会根据 Hash 算法找到其存储位置,直接取出该 Entry。由此可见:HashMap 之所以能快速存、取它所包含的 Entry,完全类似于现实生活中母亲从小教我们的:不同的东西要放在不同的位置,需要时才能快速找到它。 

当创建 HashMap 时,有一个默认的负载因子(load factor),其默认值为 0.75,这是时间和空间成本上一种折衷:增大负载因子可以减少 Hash 表(就是那个 Entry 数组)所占用的内存空间,但会增加查询数据的时间开销,而查询是最频繁的的操作(HashMap 的 get() 与 put() 方法都要用到查询);减小负载因子会提高数据查询的性能,但会增加 Hash 表所占用的内存空间。 

掌握了上面知识之后,我们可以在创建 HashMap 时根据实际需要适当地调整 load factor 的值;如果程序比较关心空间开销、内存比较紧张,可以适当地增加负载因子;如果程序比较关心时间开销,内存比较宽裕则可以适当的减少负载因子。通常情况下,程序员无需改变负载因子的值。 

如果开始就知道 HashMap 会保存多个 key-value 对,可以在创建时就使用较大的初始化容量,如果 HashMap 中 Entry 的数量一直不会超过极限容量(capacity * load factor),HashMap 就无需调用 resize() 方法重新分配 table 数组,从而保证较好的性能。当然,开始就将初始容量设置太高可能会浪费空间(系统需要创建一个长度为 capacity 的 Entry 数组),因此创建 HashMap 时初始化容量设置也需要小心对待。 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327043458&siteId=291194637
Recommended