Hash table (hash table, hash table)

hash table

concept

Why do you need a hash table

In the static lookup table and the dynamic lookup table , in order to find a record whose key value is equal to a certain value, a series of key words must be compared to determine the storage location of the record to be checked or the search fails. The search time is always related to the number of comparisons

What is a hash table

Hash table , also called hash table, English Hash table, is a data structure that is directly accessed according to the key code value.

Basic idea of ​​hash table

  • Establish a definite relationship H between the storage location of the record and its keywords, so that each keyword corresponds to a unique storage location. And this relationship H is a hash function of the hash table.
  • When searching, it is only necessary to calculate the given key value H(k) according to the corresponding relationship, and then the storage location of the record can be obtained. In this way, without comparison, the search method of the checked element can be obtained in one access.

Hash table related terms

  • Hash function: A correspondence relationship established between the key of the record and the storage address of the record.
  • Conflict: If the keywords are different but the function values ​​are the same, the two keywords are called "synonyms" and this phenomenon is called a conflict.
  • Hash Lookup: The process of looking up using a hash function.
  • Filling factor: the number of records added to the table is mmm , the table length isnnn , then the filling factor isα = mn \alpha = \frac{m}{n}a=nm

Hash table properties

  • The hash table actually exchanges space for time, and its search time efficiency is generally higher than other methods, but it consumes space resources
  • Conflicts are generally unavoidable, and the number of conflicts is positively correlated with the filling degree of the table
  • In the case of the same hash function, the method of handling conflicts is different, and the average lookup length of the resulting hash table is also different.
  • Linear detection and then hash processing conflicts are likely to cause "secondary aggregation" of records, even causing new conflicts for keywords that are not synonymous
  • For a hash table with open addressing to handle conflicts, the table length must be greater than or equal to the number of records
  • The hash table for chain address handling conflicts does not require that the table length must be ≥ the number of records, and its average search length mainly depends on the hash function itself

construct hash function

There are many ways to construct a hash function, but the more commonly used ones are the method of dividing the remainder and taking the middle of the square, etc., which need to be selected according to the characteristics of the data and other needs

Direct Addressing

The direct addressing method takes the keyword itself or a linear function of the keyword as the address of the hash table

Let its key be kkk , then the formula for linear representation is as follows
H ( k ) = ak + b H(k) = ak+bH(k)=to k+b
The size of the address set obtained by this method is equal to that of the keyword, andthere will be no conflict

Applicable situation:

  • A given set of keywords is all elements in the keyword set, if not all keywords, there must be a certain address unit free

digital analysis

If the digits of the keywords that may appear are the same and the values ​​are known in advance, the keywords can be analyzed, and a number of "uniformly distributed" digits or their combination can be taken as the hash address.

For example, there are 80 records, the key is an 8-digit decimal number, and the length of the hash table is 100, that is, the address range is [ 0 , 99 ] [0,99][0,99]

Known keywords are shown in the figure below

After analyzing it, it is found that only 8 is used for the first position of each number, only 1 is used for the second position, only 3 and 4 are used for the third position, and only 2, 7 and 5 are used for the eighth position. However, the distribution of numbers in positions 4, 5, 6, and 7 is almost random.

Therefore, take any two digits of numbers four, five, six, and seven, or the number obtained by superimposing two digits with the other two digits as the hash address.

Applicable situation:

  • The keyword bit is larger than the hash address bit, and the possible keyword is known in advance.

take the middle of the square

If all the bits of the key are distributed unevenly, the middle bits of the square value of the key can be taken as the address of the hash table. Since the middle digits of the square value of a number are affected by all the digits of the number, the resulting hash addresses have a better distribution uniformity and fewer conflicts .

For example, build a hash table for identifiers, assuming the identifier is a letter or a letter and number. In the computer, two octal numbers are used to represent letters and numbers, and the relationship obtained by using the square method is as follows

Applicable situation:

  • Suitable for situations where you don't know all the keywords
  • This method is often used to find the hash function

folding method

If the keyword has a lot of digits, and the distribution of digits on each digit is roughly even, then shift superposition or boundary superposition can be used. That is to say, the keyword is divided into several parts, and then their superposition sum (rounded up) is used as the hash address.

  • Shift superposition is to align the lowest bits of each part after division, and then add
  • Boundary superposition is to fold back and forth along the dividing boundary from one end to the other, and then align and add

For example, there is a keyword 0442205864, and the number of digits of the hash address is specified to be 4, then the address obtained by using the above two superposition methods can be as shown in the figure below

Applicable situation:

  • It is suitable for the case where there are many key digits, and the distribution of numbers on each digit is roughly even

divisor remainder method

The method is very simple, and the remainder obtained after the key is divided by a certain number p is used as the hash address.

Also set k as a keyword, p is a prime number not greater than the length of the table or a composite number that does not contain prime factors less than 20, the formula is as follows
H ( k ) = kmod p H(k) = k\mod{p}H(k)=kmodp
Among them, the choice of p is very important, if the choice is not good, it is easy to produce synonyms.

random number method

When the keywords are not equal in length, a pseudo-random function value of the keyword can be used as the hash address.

The formula is as follows:
H ( k ) = random ( k ) H(k) = random(k)H(k)=r a n d o m ( k )
Applicable conditions:

  • Suitable for cases where keywords vary in length

resolve conflict

As mentioned above, the hash function constructed by a certain method may have the same function value for different keywords, so in this case, conflicts must be resolved.

Open Addressing

When a conflict occurs, a probing sequence is formed; address-by-address probing along this sequence until an empty location (open address) is found, and the conflicting record is placed in this address.

公式如下:
H i = ( H ( k ) + d i ) m o d    m i = 1 , 2 , 3 , . . . , k   ( k ≤ m − 1 ) H_i = (H(k)+d_i)\mod{m} \\i = 1,2,3,...,k\space(k\le m-1) Hi=(H(k)+di)modmi=1,2,3,...,k (km1 )
wherekkk is the keyword,mmm is the length of the hash table,di d_idiIncremental sequence

According to the value of the incremental sequence, it can be divided into three types of methods

  • Linear probing and rehashing: di = 1 , 2 , . . . , m − 1 d_i=1,2,...,m-1di=1,2,...,m1
  • Secondary probing and rehashing: k < m 2 , di = 1 2 , − 1 2 , 2 2 , − 2 2 , . . . , ± k 2 k< \frac{m}{2}, \space d_i= 1^{2},-1^{2},2^2,-2^2,...,\pm k^2k<2m, di=12,12,22,22,...,±k2
  • Pseudo-random detection and re-hashing: di d_idiis a sequence of pseudorandom numbers

For example, a hash table with a table length of 11 has been filled with records with keywords 17, 60, and 29, and its hash function is H ( k ) = kmod 11 H(k)= k \mod{11}H(k)=kmod1 1 , now fill in a new record whose key word is 38 into the hash table, the process and results obtained by using the above three methods are as follows

linear probing rehashing

Double Probing and Rehashing

Pseudo-random probing rehashing

chain address method

Link all records whose key word is "synonym" in a linear linked list. At this time, the hash table appears in the form of a "pointer array", and each component in the array stores the head pointer of the linked list of the corresponding hash address

rehashing

Construct several hash functions, and when a conflict occurs, use another hash function to calculate another hash address until no conflict occurs.

This method needs to set a hash function sequence in advance, and its calculation time is relatively increased.

spillover area method

Create two tables, one is the basic table, and the other is the overflow table (store all the records of keywords conflicting with the keywords in the basic table, once a conflict occurs, it will be stored in the overflow table).

Average lookup length ASL

lookup length

Find the position of the element in the hash table once according to the relationship, that is, the length of the successful search is recorded as 1. If the position of the first search is not the corresponding number, you should continue to compare and search according to the above method, and the times are in order plus 1.

If the above operation fails to find the element, the last accumulated length is the failed search length.

average lookup length

ASL (Average Search Length), that is, the average search length. In the search operation, because the time is spent on the comparison of keywords, the average number of keywords that need to be compared with the value to be searched is called the average search length ASL = ∑
i = 1 npici ASL = \sum_{i=1}^{n} p_ic_iASL=i=1npici
Among them pi p_ipiis the probability of the i-th element; ci c_iciis the number of comparisons to find the i-th element.

Of course, the average lookup length has an average successful lookup length and an average failed lookup length

When discussing ASL, we generally have equal probability of each element, that is, pi = 1 n p_i= \frac{1}{n}pi=n1

The average search length at this time is
ASL = 1 n ∑ i = 1 nci ASL=\frac{1}{n}\sum_{i=1}^{n} c_iASL=n1i=1nci

ASL in several ways

Under the condition of equal probability, the average successful search length and tie failure search length of the methods mentioned above can be given (no proof here)

linear probing rehashing

Average successful search length:
S nl ≈ 1 2 ( 1 + 1 1 − α ) S_{nl} \approx \frac{1}{2}(1+\frac{1}{1- \alpha})Snl21(1+1a1)
average failed search length:
U nl ≈ 1 2 ( 1 + 1 ( 1 − α ) 2 ) U_{nl} \approx \frac{1}{2}(1+ \frac{1}{(1-\ alpha)^2})Unl21(1+(1a )21)

Random Probe Rehashing, Double Probe Rehashing, and Rehashing

Average successful lookup length:
S nr ≈ − 1 α ln ( 1 − α ) S_{nr} \approx -\frac{1}{\alpha}ln(1-\alpha)Snra1ln(1α )
Average failure search length:
U nr ≈ 1 1 − α U_{nr} \approx \frac{1}{1-\alpha}Unr1a1

chain address

Average successful lookup length:
S nc ≈ 1 + α 2 S_{nc} \approx 1+\frac{\alpha}{2}Snc1+2a
Average failed lookup length:
U nc ≈ α + e − α U_{nc} \approx \alpha + e^{-\alpha}Unca+ea

the code

For a simple use of the hash table, please refer to the previous article

Hash sorting algorithm-CairBin's Blog

Guess you like

Origin blog.csdn.net/qq_42759112/article/details/127987905