Interview HashMap and HashTable

HashMap is implemented based on a hash table. Each element is a key-value pair. Conflicts are resolved internally through a singly linked list. When the capacity is insufficient (exceeds the threshold), it will also automatically grow.

      HashMap is not thread-safe and is only used in single-threaded environments. In multi-threaded environments, concurrentHashMap under the concurrent package can be used.

      HashMap implements the Serializable interface, so it supports serialization, and implements the Cloneable interface, so it can be cloned.

      The process of HashMap storing data is:

      HashMap internally maintains an Entry array to store data. HashMap uses a linked list to resolve conflicts. Each Entry is essentially a one-way linked list. When preparing to add a key-value pair, first calculate the hash value through the hash(key) method, and then find the storage location of the key-value pair through indexFor(hash, length). The calculation method is to first use hash&0x7FFFFFFF, and then length Modulo, this ensures that each key-value pair can be stored in the HashMap. When the calculated positions are the same, since the storage position is a linked list, the key-value pair is inserted into the head of the linked list.

      Both key and value in HashMap are allowed to be null. Key-value pairs with a null key are always placed in the linked list with table[0] as the head node.

      Once you understand the storage of data, reading the data will be easy to understand.

      The storage structure of HashMap is as shown in the figure below:

 In the figure, the purple part represents the hash table, also called a hash array. Each element of the array is the head node of a singly linked list. The linked list is used to resolve conflicts. If different keys are mapped to the same position in the array , put it into a singly linked list.

      The default value of the Entry array for storing data in HashMap is 16. If there is no expansion mechanism for Entry, when more data is stored, the linked list inside Entry will be very long, which will lose the storage meaning of HashMap. Therefore, HasnMap has its own expansion mechanism internally. Inside HashMap there are:

      Variable size, which records the number of used slots in the underlying array of HashMap;

      Variable threshold, which is the threshold of HashMap, is used to determine whether the capacity of HashMap needs to be adjusted (threshold = capacity * loading factor)    

      Variable DEFAULT_LOAD_FACTOR = 0.75f, the default load factor is 0.75

      The conditions for HashMap expansion are: when size is greater than threshold, HashMap is expanded.  

      Expansion is to create a new underlying array of HashMap, and then call the transfer method to add all the elements of the HashMap to the new HashMap (the index position of the elements in the new array needs to be recalculated). Obviously, expansion is a very time-consuming operation, because it requires recalculating the positions of these elements in the new array and copying them. Therefore, when we use HashMap, it is best to estimate the number of elements in HashMap in advance, which will help improve the performance of HashMap.

      HashMap has four construction methods. Two very important parameters are mentioned in the construction method: initial capacity and loading factor. These two parameters are important parameters that affect the performance of HashMap. The capacity represents the number of slots in the hash table (that is, the length of the hash array). The initial capacity is the capacity when the hash table is created (as can be seen from the constructor, If not specified, it defaults to 16). The load factor is a measure of how full a hash table can be before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, Then the hash table needs to be resized (i.e. expanded).

      Let’s talk about the loading factor. If the loading factor is larger, the space will be more fully utilized, but the search efficiency will be reduced (the length of the linked list will become longer and longer); if the loading factor is too small, then the data in the table will be too sparse (a lot of The space is being expanded before it is used), causing a serious waste of space. If we do not specify it in the construction method, the system default loading factor is 0.75, which is an ideal value and generally we do not need to modify it.

       In addition, no matter how much capacity we specify, the construction method will set the actual capacity to a number not less than the power of 2 of the specified capacity, and the maximum value cannot exceed 2 to the power of 30.

Hashtable is also implemented based on a hash table. Each element is also a key-value pair. Internally, conflicts are resolved through a singly linked list. When the capacity is insufficient (exceeds the threshold), it will also grow automatically.

      Hashtable is also a class introduced in JDK1.0. It is thread-safe and can be used in multi-threaded environments.

      Hashtable also implements the Serializable interface, which supports serialization, implements the Cloneable interface, and can be cloned.

 

The difference between HashTable and HashMap

      1. Different inherited parent classes

      Hashtable inherits from Dictionary class, while HashMap inherits from AbstractMap class. But both implement the Map interface.

      2. Thread safety is different

      A description of hashmap in javadoc is as follows: This implementation is not synchronous. If multiple threads access a hash map simultaneously and at least one of them structurally modifies the map, it must remain externally synchronized.

      The methods in Hashtable are Synchronized, while the methods in HashMap are non-Synchronized by default. In a multi-threaded concurrent environment, you can use Hashtable directly without synchronizing its methods yourself. However, when using HashMap, you must add synchronization processing yourself. (A structural modification is any operation that adds or removes one or more mappings; simply changing the value associated with a key that an instance already contains is not a structural modification.) This is typically done by synchronizing the object that naturally encapsulates the mapping. operation to complete. If no such object exists, the Map should be   "wrapped" using the Collections.synchronizedMap method. This is best done at creation time to prevent accidental unsynchronized access to the map, like this:

      Map m = Collections.synchronizedMap(new HashMap(...));

      Hashtable thread safety is easy to understand because it adds Synchronize to every method. Here we analyze why HashMap is thread-unsafe:

      The bottom layer of HashMap is an Entry array. When a hash conflict occurs, the hashmap is solved in the form of a linked list, and the head node of the linked list is stored in the corresponding array position. For linked lists, newly added nodes will be added from the beginning.

Let's analyze multi-threaded access:

      (1) When hashmap performs put operation, the following method will be called:

  1. // Add new Entry. Insert "key-value" into the specified position, bucketIndex is the position index.      
  2.     void addEntry(int hash, K key, V value, int bucketIndex) {      
  3.         // Save the value of the "bucketIndex" position to "e"      
  4.         Entry<K,V> e = table[bucketIndex];      
  5.         //Set the element at the "bucketIndex" position to "New Entry",      
  6.         // Set "e" to "the next node of the new Entry"      
  7.         table[bucketIndex] = new Entry<K,V>(hash, key, value, e);      
  8.         // If the actual size of HashMap is not less than the "threshold", adjust the size of HashMap      
  9.         if (size++ >= threshold)      
  10.             resize(2 * table.length);      
  11.     }  

      The above method will be called when hashmap performs put operation. Now if thread A and thread B call addEntry on the same array position at the same time, both threads will get the current head node at the same time. Then after A writes the new head node, B also writes the new head node. Then B's write operation will overwrite A's write operation, causing A's write operation to be lost.

        (2) Code to delete key-value pairs

  1. <span style="font-size: 18px;"> </span>// Delete the element with "key"      
  2.     final Entry<K,V> removeEntryForKey(Object key) {      
  3.         // Get the hash value. If key is null, the hash value is 0; otherwise, hash() is called for calculation.      
  4.         int hash = (key == null) ? 0 : hash(key.hashCode());      
  5.         int i = indexFor(hash, table.length);      
  6.         Entry<K,V> prev = table[i];      
  7.         Entry<K,V> e = prev;      
  8.      
  9.         // Delete the element with "key" in the linked list      
  10.         //The essence is "delete the node in the one-way linked list"      
  11.         while (e != null) {      
  12.             Entry<K,V> next = e.next;      
  13.             Object k;      
  14.             if (e.hash == hash &&      
  15.                 ((k = e.key) == key || (key != null && key.equals(k)))) {      
  16.                 modCount++;      
  17.                 size--;      
  18.                 if (prev == e)      
  19.                     table[i] = next;      
  20.                 else     
  21.                     prev.next = next;      
  22.                 e.recordRemoval(this);      
  23.                 return e;      
  24.             }      
  25.             prev = e;      
  26.             e = next;      
  27.         }      
  28.      
  29.         return e;      
  30.     }  

      When multiple threads operate the same array position at the same time, they will first obtain the head node stored at that position in the current state, and then perform calculation operations separately, and then write the results to the array position. In fact, writing back At that time, other threads may have modified this position, and the modifications of other threads will be overwritten.

      (3) In addEntry, when a new key-value pair is added and the total number of key-value pairs exceeds the threshold, a resize operation will be called. The code is as follows:

  1. //Resize HashMap, newCapacity is the adjusted capacity      
  2.     void resize(int newCapacity) {      
  3.         Entry[] oldTable = table;      
  4.         int oldCapacity = oldTable.length;     
  5.         //If the capacity has reached the maximum value, it cannot be expanded and returns directly    
  6.         if (oldCapacity == MAXIMUM_CAPACITY) {      
  7.             threshold = Integer.MAX_VALUE;      
  8.             return;      
  9.         }      
  10.      
  11.         // Create a new HashMap and add all the elements of the "old HashMap" to the "new HashMap".      
  12.         // Then, assign the "new HashMap" to the "old HashMap".      
  13.         Entry[] newTable = new Entry[newCapacity];      
  14.         transfer(newTable);      
  15.         table = newTable;      
  16.         threshold = (int)(newCapacity * loadFactor);      
  17.     }  

      This operation will generate a new array with a new capacity, then recalculate and write all the key-value pairs of the original array into the new array, and then point to the newly generated array.

      When multiple threads detect that the total number exceeds the threshold at the same time, they will call the resize operation at the same time, each generate a new array, rehash it and assign it to the underlying array table of the map. As a result, only the new array generated by the last thread is used. If assigned to table variables, those of other threads will be lost. Moreover, when some threads have completed the assignment and other threads have just started, the assigned table will be used as the original array, which will also cause problems.

      3. Whether to provide the contains method?

      HashMap removed the contains method of Hashtable and changed it to containsValue and containsKey, because the contains method is easily misleading.

      Hashtable retains three methods: contains, containsValue and containsKey, among which contains and containsValue have the same functions.

Let’s take a look at the source code of Hashtable’s ContainsKey method and ContainsValue:

  1. public boolean containsValue(Object value) {      
  2.      return contains(value);      
  3.  }  

  1. // Determine whether the Hashtable contains "value"      
  2.  public synchronized boolean contains(Object value) {      
  3.      //Note that the value in the Hashtable cannot be null.      
  4.      // If it is null, throw an exception!      
  5.      if (value == null) {      
  6.          throw new NullPointerException();      
  7.      }      
  8.     
  9.      // Traverse the elements in the table array from back to front (Entry)      
  10.      // For each Entry (one-way linked list), traverse one by one to determine whether the value of the node is equal to value      
  11.      Entry tab[] = table;      
  12.      for (int i = tab.length ; i-- > 0 ;) {      
  13.          for (Entry<K,V> e = tab[i] ; e != null ; e = e.next) {      
  14.              if (e.value.equals(value)) {      
  15.                  return true;      
  16.              }      
  17.          }      
  18.      }      
  19.      return false;      
  20.  }  

  1. // Determine whether Hashtable contains key      
  2.  public synchronized boolean containsKey(Object key) {      
  3.      Entry tab[] = table;      
  4. /Calculate the hash value and directly replace it with the hashCode of the key    
  5.      int hash = key.hashCode();        
  6.      // Calculate the index value in the array     
  7.      int index = (hash & 0x7FFFFFFF) % tab.length;      
  8.      // Find the "Entry (linked list) corresponding to the key", and then find the element in the linked list whose "hash value" and "key value" are equal to the key      
  9.      for (Entry<K,V> e = tab[index] ; e != null ; e = e.next) {      
  10.          if ((e.hash == hash) && e.key.equals(key)) {      
  11.              return true;      
  12.          }      
  13.      }      
  14.      return false;      
  15.  }  

      Let's take a look at the source code of HashMap's ContainsKey method and ContainsValue:

  1. //Whether HashMap contains key      
  2.     public boolean containsKey(Object key) {      
  3.         return getEntry(key) != null;      
  4.     }  

  1. // Return the key-value pair with "key"      
  2.     final Entry<K,V> getEntry(Object key) {      
  3.         // Get the hash value      
  4.         // HashMap stores the elements with "key is null" in table[0], and calls hash() to calculate the hash value if "key is not null".      
  5.         int hash = (key == null) ? 0 : hash(key.hashCode());      
  6.         // Find the element whose "key value is equal to key" on the "linked list corresponding to the hash value"      
  7.         for (Entry<K,V> e = table[indexFor(hash, table.length)];      
  8.              e != null;      
  9.              e = e.next) {      
  10.             Object k;      
  11.             if (e.hash == hash &&      
  12.                 ((k = e.key) == key || (key != null && key.equals(k))))      
  13.                 return e;      
  14.         }      
  15.         return null;      
  16.     }  

  1. // Whether it contains elements with "value"      
  2.     public boolean containsValue(Object value) {      
  3.     // If "value is null", call containsNullValue() to find      
  4.     if (value == null)      
  5.             return containsNullValue();      
  6.      
  7.     // If "value is not null", find whether there is a node with value in HashMap.      
  8.     Entry[] tab = table;      
  9.         for (int i = 0; i < tab.length ; i++)      
  10.             for (Entry e = tab[i] ; e != null ; e = e.next)      
  11.                 if (value.equals(e.value))      
  12.                     return true;      
  13.     return false;      
  14.     }  

Through comparison of the above source codes, we can get the fourth difference

      4. Whether null values ​​are allowed for key and value

      The key and value are both objects and cannot contain duplicate keys, but they can contain duplicate values.

      We can clearly see from the source code of the ContainsKey method and ContainsValue above:

      In Hashtable, null values ​​are not allowed in both key and value. However, if there is an operation like put(null,null) in the Hashtable, the compilation can also pass, because the key and value are both Object types, but a NullPointerException will be thrown at runtime, which is stipulated by the JDK specification.
In HashMap, null can be used as a key, and there is only one such key; there can be one or more keys whose corresponding value is null. When the get() method returns a null value, it may be that the key does not exist in the HashMap, or the value corresponding to the key may be null. Therefore, in HashMap, the get() method cannot be used to determine whether a certain key exists in the HashMap, but the containsKey() method should be used to determine.

      5. The internal implementation of the two traversal methods is different.

      Both Hashtable and HashMap use Iterator. Due to historical reasons, Hashtable also uses Enumeration.

      6. Different hash values

      The use of hash values ​​is different. HashTable directly uses the hashCode of the object. And HashMap recalculates the hash value.

      HashCode is an int type value calculated by jdk based on the object's address, string, or number.

      Hashtable calculates the hash value by directly using the hashCode() of the key, while HashMap recalculates the hash value of the key. When Hashtable finds the position index corresponding to the hash value, it uses the modulo operation, while when HashMap finds the position index, it uses operation, and here we usually use hash&0x7FFFFFFF first, and then modulo the length. The purpose of &0x7FFFFFFFF is to convert the negative hash value into a positive value, because the hash value may be a negative number, and after &0x7FFFFFFF, only the sign changes, and the following The bits are unchanged.

      7. The internal implementation uses different array initialization and expansion methods.

      The default capacity of HashTable without specifying the capacity is 11, while that of HashMap is 16. Hashtable does not require that the capacity of the underlying array must be an integer power of 2, while HashMap requires that the capacity of the underlying array must be an integer power of 2.
      When Hashtable is expanded, the capacity is doubled plus 1, while when HashMap is expanded, the capacity is doubled.

      The initial size and expansion methods of arrays in the two internal implementations of Hashtable and HashMap. The default size of the hash array in HashTable is 11, and the increasing method is old*2+1.

There are the following four methods to resolve hash conflicts (hash conflicts):

Chain address method Rehash method
Establish public
overflow area
Open addressing method

Method 1: Chain address method
For the same hash value, use a linked list to connect. (HashMap uses this method)

advantage

Conflict handling is simple and there is no accumulation. That is, non-synonyms will never conflict, so the average search length is shorter;
suitable for situations where the total number changes frequently. (Because the node space on each linked list in the zipper method is dynamically applied for)
it takes up little space. The filling factor can be α≥1, and when the node is large, the pointer domain added in the zipper method can be ignored. The
operation of deleting nodes is easy to implement. Just simply delete the corresponding node on the linked list.
shortcoming

Query efficiency is low. (Storage is dynamic, and it takes more time to jump when querying.)
When the key-value can be predicted and there are no subsequent additions and modifications, the performance of the open addressing method is better than that of the chain address method.
Not easy to serialize
Method 2: The rehash method
provides multiple hash functions. If the hash value of the key calculated by the first hash function conflicts, the second hash function is used to calculate the hash value of the key. .

advantage

Not prone to aggregation
Disadvantages

Increased calculation time.
Method 3: Establish a public overflow area
. Divide the hash table into two parts: the basic table and the overflow table. All elements that conflict with the basic table will be filled in the overflow table.

Method 4: Open addressing method
When the hash address p = H (key) of the keyword key conflicts, another hash address p1 is generated based on p. If p1 still conflicts, another hash address p1 is generated based on p. A hash address p2,..., until a hash address pi that does not conflict is found, and the corresponding element is stored in it.

即:Hi=(H(key)+di)% m (i=1,2,...,n)

The open addressing method has the following three methods:

Linear probing and then hashing
Check the next unit sequentially until an empty unit is found or search the entire table
di=1, 2, 3, ..., m-1
Quadratic (square) probing and then hashing
to jump to the left and right of the table type detection until an empty unit is found or the entire table is searched
di=12, -12, 22, -22,..., k2, -k2 (k<=m/2)
pseudo-random detection and then hashing
to establish a pseudo-random Number generator and give a random number as a starting point
di = pseudo-random number sequence. During specific implementation, a pseudo-random number generator should be established (such as i=(i+p) % m), and a random number should be given as the starting point.

For example, it is known that the length of the hash table is m=11, and the hash function is: H (key) = key % 11, then H (47) = 3, H (26) = 4, H (60) = 5, assuming the following If a keyword is 69, then H(69)=3, which conflicts with 47.

If linear detection and then hashing are used to handle conflicts, the next hash address is H1=(3 + 1)% 11 = 4. If there is still a conflict, the next hash address is H2=(3 + 2)% 11 = 5. , there is still a conflict, continue to find the next hash address as H3 = (3 + 3)% 11 = 6, there is no more conflict at this time, fill 69 into unit 5.

If you use secondary detection and hashing to handle conflicts, the next hash address is H1 = (3 + 12) % 11 = 4. If there is still a conflict, find the next hash address H2 = (3 - 12) % 11 = 2. There is no conflict at this time, and 69 is filled in unit 2.

If pseudo-random detection and then hashing are used to handle conflicts, and the pseudo-random number sequence is: 2, 5, 9,..., then the next hash address is H1 = (3 + 2)% 11 = 5, and there is still a conflict. Find the next hash address as H2 = (3 + 5)% 11 = 8. There is no more conflict at this time, and 69 is filled in unit 8.

advantage

Easy to serialize.
If the total number of data can be predicted, a perfect hash sequence can be created.
Disadvantages

Takes up a lot of space. (In order to reduce conflicts, the open addressing method requires the filling factor α to be small, so when the node size is large, a lot of space will be wasted)
Deleting nodes is troublesome. You cannot simply leave the space of the deleted node empty, otherwise the search path of the synonym node filled in the hash table after it will be truncated. This is because in various open address methods, empty address units (ie open addresses) are conditions for search failure. Therefore, when you perform a deletion operation on a hash table that uses the open address method to handle conflicts, you can only mark the deleted node for deletion, but you cannot actually delete the node.

Guess you like

Origin blog.csdn.net/a154555/article/details/127468428