Java Core Technology Interview Essentials (Lecture 9) | What is the difference between Hashtable, HashMap, and TreeMap?

Map is another part of the generalized Java collection framework. As one of the most frequently used types in the framework, HashMap itself and related types are naturally also hotspots for interviews.

The question I want to ask you today is, what is the difference between Hashtable, HashMap, and TreeMap? Talk about your mastery of HashMap.


Typical answer

Hashtable, HashMap, and TreeMap are some of the most common Map implementations, which are container types that store and manipulate data in the form of key-value pairs.

Hashtable is a hash table implementation provided by the early Java class library. It is synchronous and does not support null keys and values. Due to the performance overhead caused by synchronization, it is rarely recommended to use it.

HashMap is a more widely used hash table implementation. The behavior is roughly the same as HashTable. The main difference is that HashMap is not synchronized and supports null keys and values. Under normal circumstances, HashMap can achieve constant-time performance through put or get operations, so it is the first choice for most access scenarios using key-value pairs, for example, to implement a runtime storage structure corresponding to user ID and user information.

TreeMap is a map based on red-black trees that provides sequential access. Unlike HashMap, its operations such as get, put, and remove are all O(log(n)) time complexity, and the specific order can be specified by Comparator to decide, or to judge according to the natural order of the keys.

Test site analysis

The above answer is just a brief summary of some basic features. There are many issues related to Map that can be extended, from various data structures, typical application scenarios, to technical considerations of program design and implementation, especially in Java 8. HashMap itself occurs It has undergone a very big change, and these are aspects that are frequently investigated.

Many friends gave me feedback that the interviewer seems to like to examine the design and implementation details of HashMap, so today I will add the corresponding source code interpretation, mainly focusing on the following aspects:

  • Understand the similar overall structure of Map, especially some key points of ordered data structure.
  • Analyze the design and implementation of HashMap from the source code, understand the capacity, load factor, etc., why these parameters are needed, how to affect the performance of the Map, and how to choose in practice.
  • Understand the related principles of tree transformation and the reasons for improvement.

In addition to typical code analysis, there are also some interesting concurrency-related issues that are often mentioned. For example, in a concurrent environment, HashMap may have strange problems such as infinite loop occupying CPU and inaccurate size.

I think this is a typical usage error, because HashMap clearly states that it is not a thread-safe data structure. If you ignore this and simply use it in a multi-threaded scenario, problems will inevitably occur.

Understanding the cause of this error is also a good way to understand the operation of concurrent programs. For what happened, you can refer to this analysis from a long time ago . It even provides a schematic diagram. I will not repeat what others have written.

Knowledge expansion

1. Map overall structure

First of all, we first have an overall understanding of Map related types. Although Map is usually included in the Java collection framework, it is not a collection type in a narrow sense. For details, you can refer to the simple class diagram below.

Hashtable is special. As an early collection-related type similar to Vector and Stack, it extends the Dictionary class. The class structure is obviously different from HashMap and the like.

Other Map implementations such as HashMap all extend AbstractMap, which contains general method abstractions. The purpose of different maps can be reflected in the class diagram structure, and the design purpose has been reflected in different interfaces.

In most scenarios where Map is used, it is usually put, accessed or deleted, and there is no special requirement on the order. HashMap is basically the best choice in this case. The performance of HashMap depends very much on the effectiveness of the hash code. Please be sure to master some basic conventions of hashCode and equals, such as:

  • equals is equal, hashCode must be equal.
  • Rewrite the hashCode and also rewrite equals.
  • The hashCode needs to be consistent, and the hash value returned by the state change must still be consistent.
  • The symmetry, reflection, transmission and other characteristics of equals.

There are a lot of information about this content on the Internet, so I won’t elaborate on it here.

The analysis content of ordered Map is relatively limited. Let me add some more. Although LinkedHashMap and TreeMap can guarantee a certain order, they are still very different.

  • LinkedHashMap usually provides that the traversal order conforms to the insertion order, and it is implemented by maintaining a doubly linked list for entries (key-value pairs). Note that through a specific constructor, we can create instances that reflect the order of access. The so-called put, get, compute, etc., are all counted as "access."

This behavior is suitable for some specific application scenarios. For example, we build a space-occupying-sensitive resource pool, hoping to automatically release the least frequently accessed objects. This can be achieved by using the mechanism provided by LinkedHashMap, refer to the following Example:

import java.util.LinkedHashMap;
import java.util.Map;  
public class LinkedHashMapSample {
    public static void main(String[] args) {
        LinkedHashMap<String, String> accessOrderedMap = new LinkedHashMap<String, String>(16, 0.75F, true){
            @Override
            protected boolean removeEldestEntry(Map.Entry<String, String> eldest) { // 实现自定义删除策略,否则行为就和普遍Map没有区别
                return size() > 3;
            }
        };
        accessOrderedMap.put("Project1", "Valhalla");
        accessOrderedMap.put("Project2", "Panama");
        accessOrderedMap.put("Project3", "Loom");
        accessOrderedMap.forEach( (k,v) -> {
            System.out.println(k +":" + v);
        });
        // 模拟访问
        accessOrderedMap.get("Project2");
        accessOrderedMap.get("Project2");
        accessOrderedMap.get("Project3");
        System.out.println("Iterate over should be not affected:");
        accessOrderedMap.forEach( (k,v) -> {
            System.out.println(k +":" + v);
        });
        // 触发删除
        accessOrderedMap.put("Project4", "Mission Control");
        System.out.println("Oldest entry should be removed:");
        accessOrderedMap.forEach( (k,v) -> {// 遍历顺序不变
            System.out.println(k +":" + v);
        });
    }
}
  • For TreeMap, its overall order is determined by the order relationship of the keys, which is determined by Comparator or Comparable (natural order).

I mentioned in the question I left for you in the last lecture. The problem of building a priority scheduling system is essentially a typical priority queue scenario. The Java standard library provides a PriorityQueue based on a binary heap. They are all It relies on the same sorting mechanism, of course, it also includes TreeMap's vest TreeSet. 

Similar to the convention of hashCode and equals, in order to avoid ambiguity, the natural order also needs to conform to a convention, that is, the return value of compareTo needs to be consistent with equals, otherwise there will be ambiguity.

We can analyze the implementation of the put method of TreeMap:

public V put(K key, V value) {
    Entry<K,V> t = …
    cmp = k.compareTo(t.key);
    if (cmp < 0)
        t = t.left;
    else if (cmp > 0)
        t = t.right;
    else
        return t.setValue(value);
        // ...
   }

What can you see from the code? When I don't follow the agreement, two objects that don't meet the requirement of equals are treated as the same (because compareTo returns 0), which leads to ambiguous behavior.

2. HashMap source code analysis

As mentioned earlier, the design and implementation of HashMap is a very high frequency interview question, so I will conduct a relatively detailed source code interpretation here, mainly around:

  • Basic point analysis is implemented inside HashMap.
  • Capacity and load factor.
  • Treeing. 

First, let's take a look at the internal structure of HashMap. It can be seen as a composite structure composed of an array (Node[] table) and a linked list. The array is divided into buckets, and the key is determined by the hash value. The addressing of value pairs in this array; key-value pairs with the same hash value are stored in the form of a linked list, you can refer to the following diagram. It should be noted here that if the size of the linked list exceeds the threshold (TREEIFY_THRESHOLD, 8), the linked list in the figure will be transformed into a tree structure.

Judging from the implementation of the non-copy constructor, this table (array) does not seem to be initialized at the beginning, only some initial values ​​are set.

public HashMap(int initialCapacity, float loadFactor){  
    // ... 
    this.loadFactor = loadFactor;
    this.threshold = tableSizeFor(initialCapacity);
}

Therefore, we deeply suspect that HashMap may be initialized when it is used for the first time in accordance with the lazy-load principle (except for the copy constructor, I will only introduce the most common scenarios here). In this case, let's take a look at the implementation of the put method. It seems that there is only one call to putVal:

public V put(K key, V value) {
    return putVal(hash(key), key, value, false, true);
}

It seems that the main secret is hidden in putVal. What's the secret? In order to save space, I only intercepted the key parts of putVal.

final V putVal(int hash, K key, V value, boolean onlyIfAbent,
               boolean evit) {
    Node<K,V>[] tab; Node<K,V> p; int , i;
    if ((tab = table) == null || (n = tab.length) = 0)
        n = (tab = resize()).length;
    if ((p = tab[i = (n - 1) & hash]) == ull)
        tab[i] = newNode(hash, key, value, nll);
    else {
        // ...
        if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for first 
           treeifyBin(tab, hash);
        //  ... 
     }
}

From the first few lines of the putVal method, we can find several interesting places:

  • If the table is null, the resize method will be responsible for initializing it, as can be seen from tab = resize().
  • The resize method takes into account two responsibilities, creating an initial storage table, or expanding (resize) when the capacity does not meet the demand.
  • In the process of placing new key-value pairs, expansion will occur if the following conditions occur. 
if (++size > threshold)
    resize();
  • The position of the specific key-value pair in the hash table (array index) depends on the following bit operations:
i = (n - 1) & hash

Carefully observe the source of the hash value, we will find that it is not the hashCode of the key itself, but from another hash method inside the HashMap. Note, why is it necessary to shift the high-order data to the low-order bit for XOR operation? This is because the difference in the hash value calculated by some data is mainly in the high bit, and the hash addressing in the HashMap ignores the high bit above the capacity, then this processing can effectively avoid the hash collision in similar situations. (Netizen's note: What does it mean here? I remember reading it before. The reason is that even if you have a hash value that exceeds the capacity, you will eventually go through the internal hash algorithm to change the value to within the capacity range.)

static final int hash(Object kye) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>>16;
}
  • The linked list structure I mentioned earlier (here called bin) will be treeed when it reaches a certain threshold. I will analyze why HashMap needs to process bins later.

As you can see, the logic of the putVal method itself is very concentrated. From initialization, expansion to treeing, everything is related to it. It is recommended that you refer to the main logic above when reading the source code.

Let me further analyze the resize method of multiple jobs. Many friends have reported that interviewers are often asked about its source code design.

final Node<K,V>[] resize() {
    // ...
    else if ((newCap = oldCap << 1) < MAXIMUM_CAPACIY &&
                oldCap >= DEFAULT_INITIAL_CAPAITY)
        newThr = oldThr << 1; // double there
       // ... 
    else if (oldThr > 0) // initial capacity was placed in threshold
        newCap = oldThr;
    else {  
        // zero initial threshold signifies using defaultsfults
        newCap = DEFAULT_INITIAL_CAPAITY;
        newThr = (int)(DEFAULT_LOAD_ATOR* DEFAULT_INITIAL_CAPACITY;
    }
    if (newThr ==0) {
        float ft = (float)newCap * loadFator;
        newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?(int)ft : Integer.MAX_VALUE);
    }
    threshold = neThr;
    Node<K,V>[] newTab = (Node<K,V>[])new Node[newap];
    table = n;
    // 移动到新的数组结构e数组结构 
   }

 According to the resize source code, without considering extreme cases (the theoretical maximum limit of capacity is specified by MAXIMUM_CAPACITY, the value is 1<<30, which is 2 to the 30th power), we can summarize it as:

  • The threshold is equal to (load factor) x (capacity). If they are not specified when constructing the HashMap, then it is based on the corresponding default constant value.
  • The threshold is usually adjusted in multiples (newThr = oldThr << 1). As I mentioned earlier, according to the logic in putVal, when the number of elements exceeds the threshold size, the Map size is adjusted.
  • After expansion, the elements in the old array need to be relocated to the new array, which is a major source of overhead for expansion.

3. Capacity, load factor and treeing

Earlier, we quickly sorted out the relevant logic of HashMap from creation to putting key-value pairs. Now think about why we need to care about capacity and load factor?

This is because the capacity and load factor determine the number of buckets available. Too many empty buckets will waste space, and if they are too full, it will seriously affect the performance of the operation. In extreme cases, assuming there is only one bucket, then it degenerates into a linked list and cannot provide the so-called constant-time storage performance at all.

Since capacity and load factor are so important, how should we choose in practice?

If you can know the number of key-value pairs to be accessed by the HashMap, you can consider setting an appropriate size in advance. The specific value can be simply estimated according to the conditions under which the expansion occurs. According to the previous code analysis, we know that it needs to meet the calculation conditions:

 Load factor * capacity> number of elements

Therefore, the preset capacity needs to be satisfied, greater than the "estimated number of elements / load factor", and it is a power of 2. The conclusion is very clear.

For the load factor, I suggest:

  • If there is no special requirement, do not change it lightly, because the default load factor of the JDK itself is very much in line with the requirements of general scenarios.
  • If you really need to adjust, it is recommended not to set a value higher than 0.75, because it will significantly increase conflicts and reduce the performance of HashMap.
  • If you use a load factor that is too small, adjust the preset capacity value according to the above formula, otherwise it may cause more frequent expansion, increase unnecessary overhead, and your own access performance will also be affected.

 We mentioned tree transformation earlier, and the corresponding logic is mainly in putVal and treeifyBin.

final void treeifyBin(Node<K,V>[] tab, int hash) {
    int n, index; Node<K,V> e;
    if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)
        resize();
    else if ((e = tab[index = (n - 1) & hash]) != null) {
        //树化改造逻辑
    }
}

The above is a simplified treeifyBin indication. Combining these two methods, the logic of tree transformation is very clear. It can be understood that when the number of bins is greater than TREEIFY_THRESHOLD: 

  • If the capacity is less than MIN_TREEIFY_CAPACITY, only a simple expansion will be carried out.
  • If the capacity is greater than MIN_TREEIFY_CAPACITY, tree transformation will be carried out.

So, why should HashMap be treeed?

Essentially this is a security issue. Because in the process of element placement, if an object has a hash conflict and is placed in the same bucket, a linked list will be formed. We know that the query of the linked list is linear, which will seriously affect the performance of access.

In the real world, it is not very complicated to construct hash collision data. Malicious code can use these data to interact with the server in a large amount, causing a large amount of server-side CPU usage. This constitutes a hash collision denial of service attack. Similar attacks have occurred in front-line Internet companies.

Today, I compared several implementations related to Map, analyzed various Maps, explained the confusion of ordered collection types, and analyzed the basic structure of HashMap from the source code level. I hope it will be helpful to you.

Practice one lesson

Do you know what we are discussing today? Leave a question for you, what are the typical methods for solving hash conflicts? (Netizen's notes: Open addressing method and then hash method chain address method public overflow table)


Other classic answers 

The following is from the netizen Tianliang Haoqiu's answer to each lesson:

Common methods for solving hash conflicts are: The basic idea of ​​the

open addressing method
is: when the hash address p=H (key) of the key is in conflict, use p as the basis to generate another hash address p1, if p1 is still Conflict, and then use p as the basis to generate another hash address p2,..., until a non-conflicting hash address pi is found, and the corresponding element is stored in it.

Re-hashing
This method is to construct multiple different hash functions at the same time:
Hi=RH1(key) i=1,2,...,k
When the hash address Hi=RH1(key) conflicts, then calculate Hi =RH2(key)...... until the conflict no longer occurs. This method is not easy to produce aggregation, but it increases the calculation time. The basic idea of ​​the

chain address method
is to form a singly linked list called a synonym chain for all elements with a hash address of i, and store the head pointer of the singly linked list in the i-th unit of the hash table, thus searching, Insertion and deletion are mainly carried out in the synonym chain. The chain address method is suitable for frequent insertions and deletions.

Establishing a public overflow area
The basic idea of ​​this method is to divide the hash table into two parts: the basic table and the overflow table. All elements that conflict with the basic table are filled in the overflow table.

The following comes from the answer of Mr. Sankou, a netizen, for each lesson and practice:

The most common method is linear rehashing. That is, when an element is inserted, there is no conflict and it is placed under the empty slot under the original rule. When a conflict occurs, the hash table is simply traversed to find the next empty slot in the table, and the element is inserted. When searching for an element, find the element at the corresponding position, if it does not match, traverse the hash table.
Then we non-linear and then hash, that is, when there is a conflict, then hash. The core idea is that if a conflict occurs, a new hash value is generated for addressing, and if there is still a conflict, continue.
The main disadvantage of the above method is that you cannot delete elements from the table.

There is the external zipper of our hashmap idea.

The following is the answer from the netizen's official account-Technology Sleeplessly:

Hashtable, HashMap, and TreeMap experience all

three implement the Map interface. The stored content is a key-value-based mapping of key-value pairs. A mapping cannot have duplicate keys, and a key can only map one value at most.

(1)
The key and value in the element characteristic HashTable cannot be null; the key and value in the HashMap can be null. Obviously, there can only be one key-value pair with a null key, but multiple keys with a value of null are allowed Value pair; when the Comparator interface is not implemented in TreeMap, the key cannot be null; when the Comparator interface is implemented, if the null condition is not judged, the key cannot be null, and vice versa.

(2) Sequential characteristics
HashTable and HashMap have disorder characteristics. TreeMap is implemented using red-black trees (the value of each node in the tree will be greater than or equal to the value of all nodes in its left subtree, and less than or equal to the value of all nodes in its right subtree) , The SortMap interface is implemented, and the saved records can be sorted according to the key. Therefore, when sorting is generally required, TreeMap is selected. The default is ascending sorting method (depth-first search), and the Comparator interface can be customized to realize the sorting method.

(3) Initialization and growth mode
initialization: The default capacity of HashTable without specifying the capacity is 11, and the capacity of the underlying array is not required to be an integer power of 2; the default capacity of HashMap is 16, and the capacity is required to be certain It is an integer power of 2.
When expanding: Hashtable will double the original capacity by 1; HashMap will double the original capacity when expanding.

(4) Thread safety
HashTable's methods and functions are all synchronized (using synchronized modification), there will be no two threads operating on the data at the same time, so thread safety is guaranteed. Because of this, the efficiency performance is very low in a multi-threaded operating environment. Because when a thread accesses the synchronization method of HashTable, other threads also access the synchronization method will enter the blocking state. For example, when a thread is adding data, another thread must be blocked even if it performs the operation of obtaining other data, which greatly reduces the operating efficiency of the program. It has been abandoned in the new version and is not recommended.
HashMap does not support thread synchronization, that is, multiple threads can write HashMap at the same time at any time; it may cause data inconsistency. If you need to synchronize (1) you can use the synchronizedMap method of Collections; (2) use the ConcurrentHashMap class. Compared with HashTable, which locks the entire object, ConcurrentHashMap implements lock segmentation technology based on lock. First divide the data stored in the Map into a segment of storage, and then assign a lock to each segment of data. When a thread occupies the lock to access the data in one of the segments, the data in other segments can also be accessed by other threads. ConcurrentHashMap not only guarantees the security of data access in the multi-threaded runtime environment, but also has a significant improvement in performance.

(5) A paragraph HashMap
HashMap is based on the idea of ​​hashing to realize the reading and writing of data. When we pass the key-value pair to the put() method, it calls the hashCode() method of the key object to calculate the hashcode, and then finds the bucket location to store the value object. When obtaining an object, find the correct key-value pair through the equals() method of the key object, and then return the value object. HashMap uses a linked list to solve the collision problem. When a collision occurs, the object will be stored in the next node of the linked list. HashMap stores key-value pair objects in each linked list node. When the hashcodes of two different key objects are the same, they will be stored in the linked list at the same bucket location, and the key-value pairs can be found through the equals() method of the key object. If the size of the linked list exceeds the threshold (TREEIFY_THRESHOLD, 8), the linked list will be transformed into a tree structure.

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_39331713/article/details/114132830