How to resolve hash conflict

By constructing a well-performing hash function, collisions can be reduced, but it is generally impossible to completely avoid collisions, so resolving collisions is another key issue in hashing. Both creating a hash table and looking up a hash table will encounter conflicts, and the methods for resolving conflicts should be the same in both cases. The following takes creating a hash table as an example to illustrate the method for resolving conflicts. There are four commonly used conflict resolution methods:

open addressing

This method is also called re-hash method. Its basic idea is: when the hash address p=H ( key) of the keyword key collides , another hash address p1 is generated based on p, if p1 still collides , and then based on p, generate another hash address p2, ..., until a non-conflicting hash address pi is found, and the corresponding element is stored in it. This method has a general form of the rehash function:

Hi = (H (key)+d i% m i = 1 ,2 ,n

Where H ( key) is the hash function, m is the table length, and d i is called the incremental sequence. The value of the incremental sequence is different, and the corresponding re-hashing method is also different. There are three main types:

Linear Probe Rehashing

d i i = 1 ,2 ,3 ,… ,m-1

The characteristic of this method is that when a conflict occurs, the next unit in the table is sequentially viewed until an empty unit is found or the entire table is searched.

Second probing and rehashing

di=12-1222-22…,k2-k2    ( k<=m/2 )

The feature of this method is: when a conflict occurs, it is more flexible to perform jump detection on the left and right sides of the table.

Pseudorandom Probe Rehashing

d i = sequence of pseudorandom numbers.

 

In the specific implementation, a pseudo-random number generator should be established (such as i=(i+p) % m), and a random number should be given as the starting point.

For example, given the hash table length m=11, the hash function is: H( key) = key % 11, then H( 47) =3, H( 26) =4, H( 60) =5, assuming the following A keyword is 69, then H( 69) = 3, which conflicts with 47.

If linear detection is used to re-hash the collision, the next hash address is H1=( 3 + 1) % 11 = 4, and there is still a conflict, and the next hash address is H2 = ( 3 + 2) % 11 = 5 , or conflict, continue to find the next hash address as H3=( 3 + 3) % 11 = 6, no conflict at this time, and fill in 69 into cell 5.

If the collision is handled by the second detection and then hashing, the next hash address is H1=( 3 + 1 2 ) % 11 = 4, and there is still a conflict, and then the next hash address is H2=( 3 - 1 2 ) % 11 = 2, there is no conflict at this time, and 69 is filled in cell 2.

If the collision is handled by pseudo-random detection and then hashing, and the pseudo-random number sequence is: 2, 5, 9, …….., then the next hash address is H1=( 3 + 2) % 11 = 5, still collision , and then find the next hash address as H2=( 3 + 5) % 11 = 8. At this time, there is no conflict, and 69 is filled in the 8th unit.

Rehashing

This approach is to construct multiple different hash functions at the same time:

Hi=RH1key)  i=1,2,…,k

When the hash address H i =RH 1 ( key) collides, calculate H i =RH 2 ( key)... until the conflict no longer occurs. This method is less prone to aggregation, but increases computation time.

chain address method

The basic idea of ​​this method is to form a singly linked list called a synonym chain for all elements whose hash address is i, and store the head pointer of the singly linked list in the i-th unit of the hash table, thus searching, inserting and deleting Mainly in synonym chains. The chain address method is suitable for frequent insertions and deletions.

 

Create a common overflow area

The basic idea of ​​this method is: divide the hash table into two parts: the basic table and the overflow table, and all elements that conflict with the basic table will be filled in the overflow table.

 


Advantages and disadvantages

Open hashing/zipper method (for bucket chain structure)

1) Advantages: ① For the case that the total number of records is frequently variable, it is better to handle (that is, to avoid the overhead of dynamic adjustment) ② Since the records are stored in the nodes, and the nodes are dynamically allocated, there will be no waste of memory , so it is especially suitable for the case where the size of the record itself is large, because the overhead of the pointer can be ignored at this time. ③ When deleting a record, it is more convenient to operate directly through the pointer.
 
2) Disadvantages: ① The stored records are randomly distributed in the memory, so when querying records, compared to data types with compact structure (such as arrays), the jump access of the hash table will bring additional time overhead ② If All key-value pairs can be predicted in advance and will not change later (that is, insertion and deletion are not allowed), a perfect hash function can be created artificially without conflict. The performance of the column will be much higher than that of the open hash ③ The record is not easy to serialize due to the use of pointers

closed hashing/open addressing

1) Advantages: ① Records are easier to serialize (serialize) operations ② If the total number of records can be predicted, a perfect hash function can be created, and the efficiency of data processing is very high at this time
 
2) Disadvantages: ① The number of stored records cannot exceed the length of the bucket array. If it exceeds, the capacity needs to be expanded, and the expansion will cause the time cost of an operation to soar, which may be a serious defect in real-time or interactive applications ② Using the detection sequence, it is possible that the time cost of its calculation is too high, resulting in a decrease in the processing performance of the hash table. ③ Since the records are stored in the bucket array, and there must be empty slots in the bucket array, when the size of the record itself is very large When the number of records is large and the total number of records is large, the space occupied by empty slots will lead to obvious waste of memory. ④ It is troublesome to delete records. For example, if record a needs to be deleted, record b is inserted into the bucket array after a, but it conflicts with record a. It is the address found by jumping again through the detection sequence, so if a is deleted directly, the position of a becomes an empty slot, and The empty slot is the termination condition for the failure of the query record, which will cause the record b to be invisible until the data is re-inserted at the position of a. Therefore, a cannot be deleted directly, but a delete flag is set. This requires additional space and manipulation.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324486338&siteId=291194637