Data Structure and Algorithm Lookup Algorithm - Hash Table (also known as Hash Table)

A hash table, also known as a hash table, is also a method used to find a specified element.

A hash table is a data structure that is accessed directly by key. The hash table maps keywords to storage addresses through a hash function, and establishes a direct mapping relationship between keywords and storage addresses. The storage address here can be an array subscript, index, memory address, etc.

Using a hash table to find an element requires solving two problems: constructing the hash table and handling collisions.
insert image description here
In Figure 8-75, if you want to find 48, you can get its storage address through the hash function, and find the key directly. The time complexity of a hash table lookup is independent of the number of elements in the table. Ideally, the time complexity of a hash table lookup is O(1).

However, the hash function may map two or more keywords to the same address, resulting in a "collision", and the conflicting different keywords are called synonyms. For example, the mapped address of 13 calculated by the hash function is also 3, which is the same as the mapped address of 48, and 13 and 48 are synonyms. Therefore, hash functions should be designed to minimize collisions, and if collisions cannot be avoided, methods to handle collisions need to be devised.

hash function

Hash function, also known as hash function, is a function that maps keywords to storage addresses. Record it as hash(key)=Addr. The following 2 principles need to be followed when designing a hash function.

  • 1) The hash function should be as simple as possible, which can quickly calculate the hash address of any keyword
  • 2) The addresses mapped by the hash function should be evenly distributed throughout the address space to avoid aggregation to reduce conflicts.

Hash function design principles reduced to a 4-word motto: simple, uniform

Common hash functions are as follows

(1) Direct addressing method

insert image description here
insert image description here

(2) Divide the remainder method

insert image description here
Why choose p to be prime?
The reason for choosing p to be prime is to avoid conflicts. Because in practical applications, the data often have a certain periodicity. If the period and p have a common prime factor, the probability of conflict will increase sharply. For example, for the gears in a watch, the number of teeth of the two meshing gears should preferably be co-prime, otherwise the probability of gear wear and tear is very high. Therefore, the probability of conflict increases rapidly with the increase of prime factors contained in p, and the more prime factors, the more conflicts.

(3) Random number method

insert image description here

ways to deal with conflicts

No matter how the hash function is designed, the collision problem cannot be avoided. If a conflict occurs, conflict resolution is required. Conflict handling methods are divided into three types: development address method, chain address method, and establishment of public overflow area.

1. open address law

insert image description here

(1) Linear detection method

insert image description here
· Average lookup length for successful lookups
insert image description here

(2) Secondary detection method

The secondary detection method adopts the method of forward and backward jump detection. When a conflict occurs, it is detected by 1 bit backward, 1 bit forward, 22 bits backward, 22 bits forward ... Jump detection to avoid accumulation.

(3) Random detection method

insert image description here

2. chain address method

The chain address method is also known as the zipper method. If different keywords are mapped to the same address through a hash function, these keywords are synonyms, and all synonyms are stored in a linear linked list. Search, insert and delete operations are mainly carried out in this linked list, and the zipper method is suitable for frequent insertion and deletion.

For example, a set of keys (14, 36, 42, 38, 40, 15, 19, 12, 51, 65, 34, 25), if the table length is 15, the hash function is hash(key)=key%13 , use the chain address method to deal with the conflict, and construct the hash table.

Algorithm diagram

According to the keyword order, the hash address is calculated according to the hash function. If the address space is empty, it will be put directly; if there is data in the address space, the chain address method will be used to deal with the conflict.

hash(14)=14%13=1, put it into the singly linked list behind space 1.
hash(36)=36%13=10, put it into the singly linked list behind space 10.
hash(42)=42%13=3, put it into the singly linked list behind space 3.
hash(38)=38%13=12, put it into the singly linked list behind the 12th space.
hash(40)=40%13=1, put it into the singly linked list behind space 1.
hash(15)=15%13=2, put it into the singly linked list behind space 2.
hash(19)=19%13=6, put it into the singly linked list behind space 6.
hash(12)=12%13=12, put it into the singly linked list behind the 12th space.
hash(51)=51%13=12, put it into the singly linked list behind the 12th space.
hash(65)=65%13=0, put it into the singly linked list behind space 0.
hash(34)=34%13=8, put it into the singly linked list behind space 8.
hash(25)=25%13=12, put it into the singly linked list behind the 12th space.
insert image description here

performance analysis

(1) Find the average search length of a successful search

Assuming that the search probability is equal (12 keywords, each keyword search probability is 1/12), the average search length of a successful search is equal to the sum of the number of comparisons of all keywords multiplied by the search probability. As can be seen from Figure 8-91, there are 8 successes in the 1st time, 2 successes in the 2nd time, 1 success in the 3rd time, and 1 success in the 4th time. The average search length of its successful search is:
ASLsucc=(1×8+2×2+3+4)/12=19/12

(2) Average search length for search failures

The hash function in this question is hash(key)=key%13, and the calculated hash addresses are 0, 1, …, 12. There are 13 cases in total. Assuming an equal probability of lookup failure (13 failure cases, each with a probability of 1/13), the average lookup length for a lookup failure is equal to the sum of the number of failed lookup comparisons multiplied by the probability for all keywords.

When hash(key)=0, if the space is empty, it can be determined that the search fails by comparing it once; if the space is not empty, it is searched in the singly linked list behind it, until it is empty, it is determined that the search fails. If there are two nodes in the singly linked list, it takes 3 comparisons to determine that the lookup failed. Similarly, hash(key)=1, ..., 12 is also calculated, as shown in Figure 8-92.
insert image description here
insert image description here

3. Create a common overflow area

In addition to the above methods for dealing with conflicts, a common overflow area can also be established, and when a conflict occurs, the keywords are put into the common overflow area. When searching, first search in the hash table according to the hash address of the keyword to be searched. If it is empty, the search fails; if it is not empty and the keywords are not equal, search in the public overflow area; if it is still not found, The search fails.

Hash lookup and performance analysis

Although a hash table establishes a direct mapping between keys and storage locations, conflicts are inevitable. In the process of searching the hash table, some keywords can be found by direct addressing for one comparison, and some keywords may still need to be compared with several keywords. The number of comparisons for different keywords is different, so the search efficiency of the hash table is Measured by average lookup length. Its search efficiency depends on 3 factors, namely the hash function, the loading factor and the method of handling collisions.

1. hash function

The criteria for measuring the quality of a hash function are: simple and uniform. That is, the calculation of the hash function is simple, and the keywords can be evenly mapped into the hash table, avoiding a large number of keywords gathering in one place, and the possibility of conflict is small.

2. filling factor

insert image description here

3. ways to deal with conflicts

insert image description here
insert image description here
For example: hash(key)=key mod 13, then the mapping address of the hash function is 0~12, a total of 13, r=13. When calculating the number of comparisons that failed to find, regardless of linear detection, secondary detection, or chain address, it will stop when empty.Empty also counts as a comparison

I will continue to update it in the future. If you like my articles, please remember to click three times in a row. Like, follow, and collect. Every like, every follow, and every collection will be the infinite motivation for me to move forward! ! ! ↖(▔▽▔)↗Thank you for your support!

Guess you like

Origin blog.csdn.net/qq_44631615/article/details/121061682