The Beauty of Data Structures and Algorithms (Hash Tables)

First, the origin of the hash table

1. The hash table is derived from the array . It extends the data structure of the array with the help of the hash function , and uses the feature that the array supports random access to elements according to the subscript .
2. The data that needs to be stored in the hash table is called the key , the method of converting the key into an array subscript is called the hash function , and the result of the hash function is called the hash value .
3. Store the data in the subscript position of the array corresponding to the hash value .

2. How to design a hash function?

The hash value calculated by the hash function is a non-negative integer;
If key1 = key2, then hash(key1) == hash(key2);
If key1 ≠ key2, then hash(key1) ≠ hash(key2). 【Attention Hash Collision 】

3. Solution to Hash Conflict

No matter how good a hash function is, hash collisions cannot be avoided. So how to solve the hash collision problem? There are two types of hash collision resolution methods that we commonly use, open addressing and chaining.

open addressing

①Core idea

If a hash collision occurs, a free location is re-probed and inserted.

②Linear Probing

Inserting data: When we insert data into the hash table, if the storage location of a certain data has been occupied after the hash function, we will start from the current location and search backwards in turn to see if there is any free location, until until found.

Find data: We use the hash function to find the hash value corresponding to the key value of the element to be found, and then compare whether the element whose subscript is the hash value in the array is equal to the element to be found. If they are equal, it means that we want to The element to be searched; otherwise, it is searched sequentially. If the free position of the array is not found, it means that the element to be searched is not in the hash table.

Delete data: In order to prevent the search algorithm from invalidating, the deleted elements can be marked as deleted. When searching linearly, if the space marked as deleted is encountered, it does not stop, but continues to probe down.

Conclusion: The worst time complexity is O(n)
③ Quadratic probing: The step size of each probing of linear probing is 1, that is, one probing in the array, and the step size of the quadratic probing becomes the original squared.
④Double hashing: Use a set of hash functions until a free location is found.
⑤Performance description of the linear detection method:
use the "loading factor" to indicate the number of vacancies, the formula: hash table loading factor = the number of entries in the table/the length of the hash table.
The larger the load factor, the less free space, the more collisions, and the performance of the hash table will be degraded.

Linked list method (more commonly used)

Inserting data: When inserting, we need to calculate the corresponding hash slot through the hash function and insert it into the corresponding linked list, so the time complexity of insertion is O(1).
Find or delete data: When finding or deleting an element, calculate the corresponding slot through a hash function, and then traverse the linked list to find or delete. For a hash function with a relatively uniform hash, the number of nodes in the linked list is k=n/m, where n represents the number of data in the hash table, and m represents the number of slots in the hash table, so the time complexity is O( k).

insert image description here

4. How to design a hash function?

1. It is necessary to make the hashed values as random and evenly distributed as possible, so as to reduce hash collisions as much as possible, and even after collisions, the data allocated to each slot is relatively uniform.
2. In addition, the design of the hash function should not be too complicated. If it is too complicated, it will take too much time and affect the performance of the hash table.
3. Common hash function design methods: direct addressing method, square method, folding method, random number method, etc.

5. How to dynamically expand the capacity according to the loading factor?

1. How to set the load factor threshold?
①You can control whether to expand or shrink by setting the threshold of the loading factor, support the hash table of dynamic expansion, and use the amortization analysis method for the time complexity of inserting data.
②The threshold setting of the loading factor needs to weigh the time complexity and space complexity. How to balance? If the memory space is not tight and the execution efficiency is high, the threshold of the load factor can be lowered; on the contrary, if the memory space is tight and the execution efficiency is not high, the threshold of the load factor can be increased.
insert image description here

2. How to avoid inefficient expansion? Batch expansion
insert image description here

①Insert operation of batch expansion: When there is new data to be inserted, we insert the data into the new hash table, and take out a data from the old hash table and put it into the new hash table. Repeat the above process for each insertion. This makes insert operations very fast.
②The query operation of batch expansion: first check the new hash table, and then check the old hash table.
③ Through batch expansion, in any case, the time complexity of inserting a data is O(1).

6. How to choose a hash conflict resolution method?

① Common 2 methods: open addressing method and linked list method.
② In most cases, the linked list method is more general. Moreover, we can also transform the linked list in the linked list method into other dynamic search data structures, such as red-black tree and skip list, to avoid the time complexity of the hash table from degenerating to O(n) and resist hash collision attacks.
insert image description here

③ However, for small-scale data and a hash table with a low loading factor, it is more suitable to use the open addressing method.

7. Why are hash tables and linked lists often used together?

1. Advantages of hash tables: support efficient data insertion, deletion and search operations
2. Disadvantages of hash tables: do not support fast sequential traversal of data in hash tables
3. How to quickly traverse data in hash tables in order? The data can only be transferred to an array, then sorted, and finally traversed through the data.
4. We know that the hash table is a dynamic data structure, which requires frequent insertion and deletion of data, so it needs to be sorted before each sequential traversal, which is bound to cause very low efficiency.
5. How to solve the above problems? It is to use a combination of hash table and linked list (or skip list).

8. How to combine hash table and linked list?

1. LRU (Least Recently Used) cache elimination algorithm

1.1. What are the main operations of the LRU cache elimination algorithm? It mainly includes 3 operations:
①Add a data to the cache;
②Delete a data from the cache;
③Search for a data in the cache;
④Summary: The above three all involve search.
1.2. How to implement LRU cache elimination algorithm with linked list?
①It is necessary to maintain a linked list structure arranged in descending order according to the access time.
② The buffer space is limited. When the space is insufficient and a piece of data needs to be eliminated, the node at the head of the linked list is directly deleted.
③ When you want to cache some data, first look up the data in the linked list. If it is not found, the data is directly placed at the end of the linked list. If found, move it to the end of the list.
④ As mentioned earlier, the three main operations of the LRU cache all involve search. If it is simply implemented by a linked list, the time complexity of the search is very high as O(n). If the linked list and hash table are used in combination, the time complexity of the search will be reduced to O(1).
1.3. How to implement LRU cache elimination algorithm using hash table and linked list?
①Use a doubly linked list to store data. Each node in the linked list stores data (data), predecessor pointer (prev), successor pointer (next) and hnext pointer (linked list pointer for resolving hash conflicts).
② The hash table resolves hash conflicts by the linked list method, so each node will be in two chains. One chain is a doubly linked list and the other chain is a zipper in a hash table. The predecessor and successor pointers are for stringing nodes in a doubly linked list, and the hnext pointer is for stringing nodes in the zipper of the hash table.
③ How do the three main operations of the LRU cache elimination algorithm achieve a time complexity of O(1)?
First of all, we make it clear that the time complexity of inserting and deleting a node in the linked list itself is O(1), because only a few pointers need to be changed.
Next, let's analyze the time complexity of the search operation. When looking for a piece of data, the data can be found in O(1) time complexity through a hash table, plus the time complexity of insertion or deletion mentioned above is O(1), so our total operation time complexity The degree is O(1).

2. Redis ordered collection

2.1. What is an ordered set?
① In an ordered set, each member object has two important attributes, namely key (key value) and score (score).
②Not only will the data be searched by the score, but also by the key.
2.2. What are the operations on sorted sets?
For example, for example, the user points ranking list has such a function: you can find point information by user ID, or you can find user ID by point interval. Here the user ID is the key, and the points are the score. Therefore, the operations of the ordered set are as follows:
①Add an object; ②Delete
an object according to the key value; ③Search for
a member object according to the key value;
Object;
⑤ Sort member variables according to the scores from small to large.
At this time, the member objects can be organized into a skip table structure according to the score value, and a hash table can be constructed according to the key value. Then all the operations above are very efficient.

3.Java LinkedHashMap

The implementation of the LRU cache elimination strategy is exactly the same. It supports traversing data in insertion order, and also supports traversing data in access order.

You can look at the underlying source code for this. I have written an article before.