[Talking about data structure] The most detailed hash table (hash table) explanation! ! ! (two)

There are three articles in the hash table series

1. Overview
of the hash table 2. The role and construction of the hash function
3. Code implementation of the hash table search

Article Directory

In the previous article, we mentioned that hash functions collide when they are stored, and also explained how the collision occurs.

But let's not rush to solve the conflict problem. Let's add a concept first. A hash table is a lookup table based on a hash function.

What exactly does the hash function do in the search?

Don’t worry, this chapter will take you to understand the role and structure of hash functions

1. When searching: hash function?

The hash function is to calculate the storage location based on the key ;
According to this function and the search keyword key, the location of the search value can be directly determined ;
Address index = f(key) .

Simply put, where to save it? Where is it?

2. Construction method of hash function

There are six methods for constructing the hash function, and each method adapts to different scenarios.

1. Direct addressing method

Take a linear function value of the keyword as a hash address
- 如：f（key） = a * key + b
Advantages: simple, uniform, and no conflict
Disadvantages: only limited to the case of address size = keyword set
- Simple, but not commonly used
- Need to know the distribution of keywords in advance
- Suitable for small and continuous lookup tables

2. Digital analysis method

Extract
- The extraction method uses a part of the key to calculate the storage location of the hash.
- Means often used in hash functions
Assuming that each key in the key set consists of s-digit numbers (k1, k2, k3, ···, kn), analyze all the data in the key and extract evenly distributed bits or their combination Constitute the whole .
- Even distribution means that it is not easy to repeat or conflict.
Usually used to deal with long keywords

If you know the distribution of keywords in advance and the distribution of several bits of the keywords is even, you can consider using this method.

Insert picture description here

3, square taking method

Each digit of the keyword has a phenomenon that some numbers repeat frequently. You can first find the square value of the keyword, expand the difference through the square , and then take the middle number as the final storage address

It is suitable for situations where the distribution of keywords is unknown and the number of bits is not very large.

4. Folding method

Split the keyword from left to right into equal parts
- The last part of the number is not enough, it can be shorter
- Add these parts together
- And according to the length of the hash table, take the last few bits as the hash address
Sometimes this may not be able to ensure uniform distribution. You might as well fold back and forth from one end to the other and align and add them.

There is no need to know the distribution of keywords in advance, and it is suitable for situations where there are more keywords.

5. In addition to the remainder method

Commonly used methods of constructing hash functions
- H（key）= key MOD p
  - p <= m
  - m is the table length
How to choose p as the key
- The key to this method is to select a suitable p. If p is not well selected, synonyms may be easily generated
- p should be less than or equal to m, preferably the smallest prime number close to m
  - Prime number: a natural number greater than 1, and only divisible by 1 and itself
- Or do not include composite numbers less than 20 prime factors
  - Prime factors: factors of prime numbers
  - Composite number: In addition to the integers that can be divided by 1 and itself, they can also be divisible by other numbers except 0.
  - The properties of composite numbers:
    - 1. All even numbers greater than 2 are composite numbers
    - 2. Among all odd numbers greater than 5, the ones with 5 are composite numbers
    - 3. Except for 0, all natural numbers with a ones place of 0 are composite numbers

This can reduce address duplication (conflict)

6. Random number method

Choose a random number, and take the random function value of the keyword as its hash address
H（key）= Random（key）
- random is a random function

The length of the keywords is not equal, this method is more appropriate

3. Reference factors for using hash function:

Since there are so many methods, which one should I use?

Choosing Difficulty Attacks······

Don't worry, I have a reference standard here. You can decide which hash function is more suitable by combining these factors:

The time required to calculate the hash address
Keyword length
Hash table size
The distribution of keywords: whether the keywords are evenly distributed and whether there are rules to follow
Record the frequency of search

The designed hash function minimizes conflicts when the above conditions are met

4. Resolve hash conflicts:

After talking about the structure of the hash function for so long, then our next highlight is, even if your hash function is well designed, but with so much data, you will inevitably encounter conflicts.

When we encounter a conflict, the program is not as smart as you, and knows how to find another address. He is just a silly stunned young man, he will just stop there stupidly when encountering difficulties. So we need to plan Plan B for him .

There are four ways to resolve hash table conflicts:

Prescribing addressing
Rehashing function
Chain address
Public spillover zone law

Next, we will explain the four methods in detail:

1. Prescription custom method

Once there is a conflict, look for the next empty hash address

As long as the hash table is large enough, an empty hash address can always be found and the record is stored
f_i( key) = ( f ( key ) + d_i ) MOD m
- d _i = 1, 2, 3, ·], m - 1

example:

Hash function f( key) = key mod 12
When key = 37, it is found that f (37) = 1, which conflicts with the position of 25
Using the above formula
- f ( 37) = ( f ( 37 ) + 1 ) mod 12 = 2
- So save 37 into the position of subscript 2

Three ways to resolve conflicts:

Linear detection method
- By continuously increasing D _I value to find the hash address space: D _I = D _I ++
- accumulation
  - If 48 and 37 are not synonymous but need to compete for an address, it is called accumulation
  - The emergence of accumulation makes us need to constantly deal with conflicts, and the efficiency of both deposit and search will be greatly reduced
Second detection method
- Linearity is a constant backward exploration, but if there is an empty position in front of it, but we continue to seek backwards, although we can get results, but the efficiency is very poor.
- So we can use two-way search to find possible locations.
- The second detection method is: the purpose of increasing the square operation is to prevent the keywords from gathering in a certain area
  - f_i ( key ) = ( f ( key ) + d_i ) MOD m
  - d _i = 1 ² , -1 ² , 2 ² , -2 ² ···
Random detection
- d _i is a set of pseudo-random numbers
- In the case of conflict, the displacement d _i is calculated by using a random function
  - The random function used here is a pseudo-random function
  - Because the random seed used is the same, calling the random function continuously can generate a sequence of numbers that will not repeat.
- Find
  - If our random seed is the same, the number sequence we get is the same every time, the same d _i can of course get the same hash address
  - f_i ( key ) = ( f ( key ) + d_i ) MOD m
    - d _i is a random number sequence

In short, the prescribing method can always find an address that does not conflict as long as the hash table is not full. It is our common method to resolve conflicts.

2. Rehashing function method

Prepare several hash functions, if the hash functions conflict, use the next one
- f _i (key) = RH _i (key)
  - i = 1, 2, ···, k
  - RH _i : Different hash functions

This method makes the keywords do not generate aggregation, of course, it also increases the calculation time accordingly.

3. Chain address method

Resolve conflicts directly in situ without using other spaces
Store only the head pointers of all synonyms in the hash table
- Synonym subtable
  - Store all records whose keywords are synonyms in a singly linked list
Advantages: Effectively handle conflicts, and provide the guarantee that the address will never be found
Disadvantages: the performance loss of traversing the singly linked list when searching

Insert picture description here

4. Public overflow zone law

Create a special storage space to store conflicting data
While looking up
- After calculating the hash address through the hash function for the given value
- Compare before the corresponding position of the basic table
  - Equal, find successful
  - Not equal, search in order in the overflow table

Suitable for situations with less data and conflicts

Insert picture description here

Now that we have resolved the conflict problem, now we will add three knowledge points about searching.

5. Search for hash table

The lookup process is consistent with the table making process

Assuming that the open address method is used to handle conflicts, the search process is:

For a given key, calculate the hash address index = f(key)
If the value of the array arr[index] is empty, the search is unsuccessful
If the array arr[index] == key, the search is successful.
Otherwise, use the conflict resolution method to find the next address until
arr[index] == key or arr[index] == null

6. Implementation of hash table search algorithm

The first is to define the structure of a hash table and some related constants
- HashTable: hash table structure
- Elem in structure: dynamic array

7. Search efficiency of hash table

ASL factors that determine hash table lookup

Hash function used
The method of choice for conflict handling
Saturation of the hash table, loading factor α = n / m
- n represents the actual loaded data length
- m is the table length

Under normal circumstances, assuming that the hash function is uniform, it is not necessary to consider its factors when discussing ASL.

The hashed ASL is a function of the conflict handling method and load factor

This is the end of the theory of hash tables. The next one is the last one in our hash table series. It will specifically talk about how hash tables are implemented.

The above is all the content of this article. If you find it helpful to you,
please use your hands, like, bookmark and forward! ! !
Every time you like is my biggest motivation for updating~

See you next time!