Hash table and method of collision

Hash method is also known as hash method, hash method and keyword address calculation method , etc. The corresponding table is called a hash table. The basic idea of ​​this method is: first , establish a correspondence f between the element's keyword k and the element's storage location p, so that p=f(k), and f is called a hash function . When creating a hash table, directly store the element with the keyword k into the unit with the address f(k); later when looking for the element with the keyword k, use the hash function to calculate the storage location p= f(k), so as to achieve the purpose of directly accessing elements by keyword.

   When the key set is very large, elements with different key values ​​may be mapped to the same address in the hash table, that is,  k1≠ k2, but  H( k1) =H( k2), this phenomenon is called conflict, At this point k1 and k2 are called synonyms. In practice, collisions are inevitable and can only be reduced by improving the performance of the hash function.

In summary, the hash method mainly includes the following two aspects:

 1) How to construct a hash function

 2) How to handle conflicts.

8.4.1 The construction method of the hash function

    The principles of constructing a hash function are: ① the function itself is easy to calculate; ② the calculated addresses are evenly distributed, that is, for any keyword k, f(k) corresponds to different addresses with equal probability, in order to reduce conflicts as much as possible.

The five methods commonly used to construct hash functions are described below.

1digital analysis

      If the set of keywords is known in advance, and the number of bits of each keyword is more than that of the address code of the hash table, several bits with a relatively even distribution can be selected from the keywords to form a hash address. For example, there are 80 records, and the key is an 8-digit decimal integer d 1 d 2 d 3 ... d 7 d 8 , if the length of the hash table is 100, the address space of the hash table is: 00~99.  Assuming that after analysis, the value distribution of d 4 and d 7 in each keyword is relatively uniform, the hash function is: h(key)=h(d 1 d 2 d 3 …d 7 d 8 )=d 4 d 7 . For example, h(81346532)=43, h(81301367)=06. On the contrary, suppose that after analysis, the value distribution of  d 1 and d 8 in each keyword is extremely uneven,  d is equal to 5, and d is equal to 2. At this time, if the hash function is: h(key)=h(d 1 d 2 d 3 …d 7 d 8 )=d 1 d 8 , then the The address codes are all 52, which is obviously not desirable.

2Squared method

When it is impossible to determine which bits of the keyword are distributed evenly, the square value of the keyword can be obtained first, and then the middle bits of the square value can be taken as the hash address as needed. This is because: the middle bits after squaring are related to each of the keywords, so different keywords will generate different hash addresses with a higher probability.

Example: We take the position number of an English letter in the alphabet as the internal code of the English letter. For example , the internal code of K is 11, the internal code of E is 05, the internal code of Y is 25, the internal code of A is 01, and the internal code of B is 02. From this, the internal code of the keyword " KEYA" is 11052501. Similarly, we can get the internal codes of the keywords " KYAB", " AKEY", and " BKEY". After the keyword is squared, the 7th to 9th bits are taken out as the hash address of the keyword, as shown in Figure 8.23.

 

 

3piecewise stacking

      This method is to divide the keyword into several parts with equal number of digits according to the number of hash table address (the last part can be shorter), then add these parts, and the result after discarding the highest carry bit is the hash of the keyword address. The specific methods include the folding method and the shifting method. The shift method is to align and add the low-order bits of each part after the division, and the folding method is to fold back and forth along the dividing boundary from one end to the other end (the odd-numbered segments are in positive order, and the even-numbered segments are in reverse order), and then the segments are added. For example: key=12360324711202065, the length of the hash table is 1000, then the keyword should be divided into 3-bit segments, and the lowest two bits 65 should be discarded here, and the shift stacking and folding stacking are performed respectively, and the hash address is obtained as 105 and 105 907, as shown in Figure 8.24.

 

4remainder method

Assuming that the length of the hash table is m, and p is the largest prime number less than or equal to m, the hash function is

h( k ) = k % p , where % is the remainder operation modulo p.

For example, it is known that the elements to be hashed are ( 18, 75, 60, 43, 54, 90, 46), the table length m=10, p=7, then there are

    h(18)=18 % 7=4    h(75)=75 % 7=5    h(60)=60 % 7=4   

    h(43)=43 % 7=1    h(54)=54 % 7=5    h(90)=90 % 7=6   

    h(46)=46 % 7=4

There are many conflicts at this time. In order to reduce the conflict, a larger m value and p value can be taken, such as m=p=13, the results are as follows:

    h(18)=18 % 13=5    h(75)=75 % 13=10    h(60)=60 % 13=8    

    h(43)=43 % 13=4    h(54)=54 % 13=2    h(90)=90 % 13=12   

    h(46)=46 % 13=7

There are no conflicts at this point, as shown in Figure 8.25.

 

5Pseudo random number method

    A pseudo-random function is used as the hash function, that is, h(key)=random(key).

In practical applications, different methods should be flexibly adopted according to the specific situation, and its performance should be tested with actual data in order to make a correct judgment. The following five factors should generally be considered:

l Time required to calculate the hash function (easy).

l The length of the keyword.

l Hash table size.

l Keyword distribution.

l Record search frequency

8.4.2 Methods of handling conflicts

   By constructing a well-performing hash function, collisions can be reduced, but it is generally impossible to completely avoid collisions, so resolving collisions is another key issue in hashing. Both creating a hash table and looking up a hash table will encounter conflicts, and the methods for resolving conflicts should be the same in both cases. The following takes creating a hash table as an example to illustrate the method for resolving conflicts. There are four commonly used conflict resolution methods:

1.          Open addressing method

This method is also called re-hash method. Its basic idea is: when the hash address p=H ( key) of the keyword key collides , another hash address p1 is generated based on p, if p1 still collides , and then based on p, generate another hash address p2, ..., until a non-conflicting hash address pi is found, and the corresponding element is stored in it. This method has a general form of the rehash function:

          Hi = (H (key)+d i% m i = 1 ,2 ,n

    Where H ( key) is the hash function, m is the table length, and d i is called the incremental sequence. The value of the incremental sequence is different, and the corresponding re-hashing method is also different. There are three main types:

l Linear detection and rehashing

    d i i = 1 ,2 ,3 ,… ,m-1

The characteristic of this method is that when a conflict occurs, the next unit in the table is sequentially viewed until an empty unit is found or the entire table is searched.

l Secondary detection and re-hashing

    di=12-1222-22…,k2-k2    ( k<=m/2 )

    The feature of this method is: when a conflict occurs, it is more flexible to perform jump detection on the left and right sides of the table.

l Pseudo-random detection and re-hashing

    d i = sequence of pseudorandom numbers.

In the specific implementation, a pseudo-random number generator should be established (such as i=(i+p) % m), and a random number should be given as the starting point.

For example, given the hash table length m=11, the hash function is: H( key) = key % 11, then H( 47) =3, H( 26) =4, H( 60) =5, assuming the following A keyword is 69, then H( 69) = 3, which conflicts with 47. If linear detection is used to re-hash the collision, the next hash address is H1=( 3 + 1) % 11 = 4, and there is still a conflict, and the next hash address is H2 = ( 3 + 2) % 11 = 5 , or conflict, continue to find the next hash address as H3=( 3 + 3) % 11 = 6, no conflict at this time, fill in 69 into unit 5, see Figure 8.26 (a). If the collision is handled by the second detection and then hashing, the next hash address is H1=( 3 + 1 2 ) % 11 = 4, and there is still a conflict, and then the next hash address is H2=( 3 - 1 2 ) % 11 = 2, there is no conflict at this time, the69 to fill in Unit 2, see Figure 8.26 (b). If the collision is handled by pseudo-random detection and hashing, and the pseudo-random number sequence is: 2, 5, 9, …….., then the next hash address is H1=( 3 + 2) % 11 = 5, still collision , and then find the next hash address as H2=( 3 + 5) % 11 = 8. At this time, there is no conflict. Fill in 69 into cell 8, see Figure 8.26 (c).

It can be seen from the above example that linear detection and re-hashing is prone to "secondary aggregation", that is, when dealing with the conflict of synonyms, it also leads to the conflict of non-synonyms. For example, when the three elements of i, i+1, i+2 in the table are full, the next hash address is i, or i+1, or i+2, or i+3 elements, will be filled in i+3 is the same unit, and the four elements are not synonyms. The advantage of linear detection and re-hashing is that as long as the hash table is not full, a non-collision hash address can be found, while secondary detection and re-hashing and pseudo-random detection and re-hashing are not necessarily.

2.          Rehashing

    This approach is to construct multiple different hash functions at the same time:

    Hi=RH1key)  i=1,2,…,k

When the hash address H i =RH 1 ( key) collides, calculate H i =RH 2 ( key)... until the conflict no longer occurs. This method is less prone to aggregation, but increases computation time.

3.          Chain address method

    The basic idea of ​​this method is to form a singly linked list called a synonym chain for all elements whose hash address is i , and store the head pointer of the singly linked list in the i-th unit of the hash table, thus searching, inserting and deleting Mainly in synonym chains. The chain address method is suitable for frequent insertions and deletions.

For example, given a set of keys ( 32, 40, 36, 53, 16, 46, 71, 27, 42, 24, 49, 64), the hash table length is 13, and the hash function is: H( key) = key % 13, the result of using the chain address method to deal with the conflict is shown in Figure 8.27:

 

Average lookup length for this example  ASL=(1 * 7+2 * 4+3 * 1)=1.5

 

4. Establish a public overflow area

The basic idea of ​​this method is to divide the hash table into two parts: the basic table and the overflow table . All elements that conflict with the basic table will be filled in the overflow table.

 

Reprinted from: http://www.360doc.com/content/14/0721/09/16319846_395862328.shtml

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324810555&siteId=291194637