hash table
concept
Why do you need a hash table
In the static lookup table and the dynamic lookup table , in order to find a record whose key value is equal to a certain value, a series of key words must be compared to determine the storage location of the record to be checked or the search fails. The search time is always related to the number of comparisons
What is a hash table
Hash table , also called hash table, English Hash table, is a data structure that is directly accessed according to the key code value.
Basic idea of hash table
- Establish a definite relationship H between the storage location of the record and its keywords, so that each keyword corresponds to a unique storage location. And this relationship H is a hash function of the hash table.
- When searching, it is only necessary to calculate the given key value H(k) according to the corresponding relationship, and then the storage location of the record can be obtained. In this way, without comparison, the search method of the checked element can be obtained in one access.
Hash table related terms
- Hash function: A correspondence relationship established between the key of the record and the storage address of the record.
- Conflict: If the keywords are different but the function values are the same, the two keywords are called "synonyms" and this phenomenon is called a conflict.
- Hash Lookup: The process of looking up using a hash function.
- Filling factor: the number of records added to the table is mmm , the table length isnnn , then the filling factor isα = mn \alpha = \frac{m}{n}a=nm
Hash table properties
- The hash table actually exchanges space for time, and its search time efficiency is generally higher than other methods, but it consumes space resources
- Conflicts are generally unavoidable, and the number of conflicts is positively correlated with the filling degree of the table
- In the case of the same hash function, the method of handling conflicts is different, and the average lookup length of the resulting hash table is also different.
- Linear detection and then hash processing conflicts are likely to cause "secondary aggregation" of records, even causing new conflicts for keywords that are not synonymous
- For a hash table with open addressing to handle conflicts, the table length must be greater than or equal to the number of records
- The hash table for chain address handling conflicts does not require that the table length must be ≥ the number of records, and its average search length mainly depends on the hash function itself
construct hash function
There are many ways to construct a hash function, but the more commonly used ones are the method of dividing the remainder and taking the middle of the square, etc., which need to be selected according to the characteristics of the data and other needs
Direct Addressing
The direct addressing method takes the keyword itself or a linear function of the keyword as the address of the hash table
Let its key be kkk , then the formula for linear representation is as follows
H ( k ) = ak + b H(k) = ak+bH(k)=to k+b
The size of the address set obtained by this method is equal to that of the keyword, andthere will be no conflict
Applicable situation:
- A given set of keywords is all elements in the keyword set, if not all keywords, there must be a certain address unit free
digital analysis
If the digits of the keywords that may appear are the same and the values are known in advance, the keywords can be analyzed, and a number of "uniformly distributed" digits or their combination can be taken as the hash address.
For example, there are 80 records, the key is an 8-digit decimal number, and the length of the hash table is 100, that is, the address range is [ 0 , 99 ] [0,99][0,99]
Known keywords are shown in the figure below
After analyzing it, it is found that only 8 is used for the first position of each number, only 1 is used for the second position, only 3 and 4 are used for the third position, and only 2, 7 and 5 are used for the eighth position. However, the distribution of numbers in positions 4, 5, 6, and 7 is almost random.
Therefore, take any two digits of numbers four, five, six, and seven, or the number obtained by superimposing two digits with the other two digits as the hash address.
Applicable situation:
- The keyword bit is larger than the hash address bit, and the possible keyword is known in advance.
take the middle of the square
If all the bits of the key are distributed unevenly, the middle bits of the square value of the key can be taken as the address of the hash table. Since the middle digits of the square value of a number are affected by all the digits of the number, the resulting hash addresses have a better distribution uniformity and fewer conflicts .
For example, build a hash table for identifiers, assuming the identifier is a letter or a letter and number. In the computer, two octal numbers are used to represent letters and numbers, and the relationship obtained by using the square method is as follows
Applicable situation:
- Suitable for situations where you don't know all the keywords
- This method is often used to find the hash function
folding method
If the keyword has a lot of digits, and the distribution of digits on each digit is roughly even, then shift superposition or boundary superposition can be used. That is to say, the keyword is divided into several parts, and then their superposition sum (rounded up) is used as the hash address.
- Shift superposition is to align the lowest bits of each part after division, and then add
- Boundary superposition is to fold back and forth along the dividing boundary from one end to the other, and then align and add
For example, there is a keyword 0442205864, and the number of digits of the hash address is specified to be 4, then the address obtained by using the above two superposition methods can be as shown in the figure below
Applicable situation:
- It is suitable for the case where there are many key digits, and the distribution of numbers on each digit is roughly even
divisor remainder method
The method is very simple, and the remainder obtained after the key is divided by a certain number p is used as the hash address.
Also set k as a keyword, p is a prime number not greater than the length of the table or a composite number that does not contain prime factors less than 20, the formula is as follows
H ( k ) = kmod p H(k) = k\mod{p}H(k)=kmodp
Among them, the choice of p is very important, if the choice is not good, it is easy to produce synonyms.
random number method
When the keywords are not equal in length, a pseudo-random function value of the keyword can be used as the hash address.
The formula is as follows:
H ( k ) = random ( k ) H(k) = random(k)H(k)=r a n d o m ( k )
Applicable conditions:
- Suitable for cases where keywords vary in length
resolve conflict
As mentioned above, the hash function constructed by a certain method may have the same function value for different keywords, so in this case, conflicts must be resolved.
Open Addressing
When a conflict occurs, a probing sequence is formed; address-by-address probing along this sequence until an empty location (open address) is found, and the conflicting record is placed in this address.
公式如下:
H i = ( H ( k ) + d i ) m o d m i = 1 , 2 , 3 , . . . , k ( k ≤ m − 1 ) H_i = (H(k)+d_i)\mod{m} \\i = 1,2,3,...,k\space(k\le m-1) Hi=(H(k)+di)modmi=1,2,3,...,k (k≤m−1 )
wherekkk is the keyword,mmm is the length of the hash table,di d_idiIncremental sequence
According to the value of the incremental sequence, it can be divided into three types of methods
- Linear probing and rehashing: di = 1 , 2 , . . . , m − 1 d_i=1,2,...,m-1di=1,2,...,m−1
- Secondary probing and rehashing: k < m 2 , di = 1 2 , − 1 2 , 2 2 , − 2 2 , . . . , ± k 2 k< \frac{m}{2}, \space d_i= 1^{2},-1^{2},2^2,-2^2,...,\pm k^2k<2m, di=12,−12,22,−22,...,±k2
- Pseudo-random detection and re-hashing: di d_idiis a sequence of pseudorandom numbers
For example, a hash table with a table length of 11 has been filled with records with keywords 17, 60, and 29, and its hash function is H ( k ) = kmod 11 H(k)= k \mod{11}H(k)=kmod1 1 , now fill in a new record whose key word is 38 into the hash table, the process and results obtained by using the above three methods are as follows
linear probing rehashing
Double Probing and Rehashing
Pseudo-random probing rehashing
chain address method
Link all records whose key word is "synonym" in a linear linked list. At this time, the hash table appears in the form of a "pointer array", and each component in the array stores the head pointer of the linked list of the corresponding hash address
rehashing
Construct several hash functions, and when a conflict occurs, use another hash function to calculate another hash address until no conflict occurs.
This method needs to set a hash function sequence in advance, and its calculation time is relatively increased.
spillover area method
Create two tables, one is the basic table, and the other is the overflow table (store all the records of keywords conflicting with the keywords in the basic table, once a conflict occurs, it will be stored in the overflow table).
Average lookup length ASL
lookup length
Find the position of the element in the hash table once according to the relationship, that is, the length of the successful search is recorded as 1. If the position of the first search is not the corresponding number, you should continue to compare and search according to the above method, and the times are in order plus 1.
If the above operation fails to find the element, the last accumulated length is the failed search length.
average lookup length
ASL (Average Search Length), that is, the average search length. In the search operation, because the time is spent on the comparison of keywords, the average number of keywords that need to be compared with the value to be searched is called the average search length ASL = ∑
i = 1 npici ASL = \sum_{i=1}^{n} p_ic_iASL=i=1∑npici
Among them pi p_ipiis the probability of the i-th element; ci c_iciis the number of comparisons to find the i-th element.
Of course, the average lookup length has an average successful lookup length and an average failed lookup length
When discussing ASL, we generally have equal probability of each element, that is, pi = 1 n p_i= \frac{1}{n}pi=n1
The average search length at this time is
ASL = 1 n ∑ i = 1 nci ASL=\frac{1}{n}\sum_{i=1}^{n} c_iASL=n1i=1∑nci
ASL in several ways
Under the condition of equal probability, the average successful search length and tie failure search length of the methods mentioned above can be given (no proof here)
linear probing rehashing
Average successful search length:
S nl ≈ 1 2 ( 1 + 1 1 − α ) S_{nl} \approx \frac{1}{2}(1+\frac{1}{1- \alpha})Snl≈21(1+1−a1)
average failed search length:
U nl ≈ 1 2 ( 1 + 1 ( 1 − α ) 2 ) U_{nl} \approx \frac{1}{2}(1+ \frac{1}{(1-\ alpha)^2})Unl≈21(1+(1−a )21)
Random Probe Rehashing, Double Probe Rehashing, and Rehashing
Average successful lookup length:
S nr ≈ − 1 α ln ( 1 − α ) S_{nr} \approx -\frac{1}{\alpha}ln(1-\alpha)Snr≈−a1ln(1−α )
Average failure search length:
U nr ≈ 1 1 − α U_{nr} \approx \frac{1}{1-\alpha}Unr≈1−a1
chain address
Average successful lookup length:
S nc ≈ 1 + α 2 S_{nc} \approx 1+\frac{\alpha}{2}Snc≈1+2a
Average failed lookup length:
U nc ≈ α + e − α U_{nc} \approx \alpha + e^{-\alpha}Unc≈a+e− a
the code
For a simple use of the hash table, please refer to the previous article