What is a hash function, the nature and concept of a hash table (hash table), and how to resolve conflicts

Hash table

Hash table

Hash function + conflict resolution method

Construction method

Direct addressing method, except the remainder method

Ways to resolve conflicts

Open addressing method: linear detection method, square detection method

Zipper method

1. Some terms of the hash table

Hash method (Hash method)

Select a function, calculate the storage location of the element by keyword according to the function, and store it according to this;

When searching, the same function calculates the address for a given value k, and compares k with the key code of the element in the address unit to determine whether the search is successful

Hash function (Hash function)

Conversion function used in the hash method

Hash table (hash table)

Hash function: H (key) = k

conflict

Different keys are mapped to the same hash address在散列查找方法中,冲突是不可能避免的,只有尽可能的减少

Keyword is different, hash function value is the same

key1 != key2, but H(key1) = H(key2)

accumulation

Accumulation is also called non-identical conflict, which means that two elements with different hash function values ​​compete for the same subsequent hash address, resulting in accumulation (or aggregation).

Non-synonym conflict

第一次冲突为同义词引起的冲突,第二次开始的冲突是非同义词引起的冲突

Synonym

Multiple keywords with the same function value冲突的关键成为同义词

Gather

The hash addresses of the two keys are different, but after competing for the same successor address, this phenomenon is called aggregation

Will cause elements that are not synonyms to be in the same search sequence, thereby increasing the search time

The clustering is more serious, you can use the square probe method

Insert picture description here

2. Construction method of hash function

There are two problems to be solved when using a hash table:

  1. Constructed hash function

    • The selected function is as simple as possible in order to increase the speed of rotation
    • The address calculated by the selected function for the key code should be evenly distributed in the hash address set to reduce the waste of space
    • Try to make the probability of the hash address appearing in any position in the table equal, thereby reducing conflicts
  2. Develop a good conflict resolution plan

    When searching, if the key code cannot be found from the address calculated by the hash function, other relevant units should be queried regularly according to the conflict resolution rules

  3. Factors to consider when constructing a hash function

    1. Execution speed (that is, the time required to calculate the hash function)
    2. Keyword length
    3. The size of the hash table (the larger the size, the smaller the possibility of conflict, but a waste of space)
    4. Keyword distribution
    5. Find frequency
  4. Constructed according to the characteristics of the element collection

    • Requirement 1: n data originally only occupies n addresses. Although the space is exchanged for time when hashing , it is still hoped that the address space of the hash is as small as possible
    • Requirement 2: No matter what method is used for storage, the purpose is to store elements as evenly as possible to avoid conflicts
  5. Commonly used construction methods

    1. Direct addressing
    2. Digital analysis
    3. Square taking method
    4. Folding method
    5. Divide and leave remainder method
    6. Random number method
  6. Direct addressing

    前提是关键字基本连续

    Take the keyword itself or a linear function value of the keyword as the hash address

    Hsah (key) = a * key + b (a and b are constants)

    Keyword and address are one-to-one correspondence, no conflict

    Only applicable when the keywords are basically continuous

    优点: A linear function value of the key code key is used as the hash address, no conflict will occur

    缺点: To occupy continuous address space, space efficiency is low

  7. Divide and leave remainder method

    Hash(key) = key mod p (p is an integer other than 0)

    关键: How to choose a suitable P? When p is a prime number (prime number), the possibility of conflict is relatively small

    技巧: Let the table length be m, take p<=m and be a prime number

3. Methods of handling conflicts

Open address method (open address method)

Chain address method (zipper method)

Rehashing method (double hash function method)

Create a common overflow area

The first conflict is a conflict caused by synonyms, and the second conflict is a conflict caused by non-synonyms.

The cause of the conflict is related to three factors

1. Loading factor a (load factor):

  • a = the number of records stored / the size of the hash table
  • When a is 0.6~0.9, the possibility of conflict is relatively small ( 既兼顾减少冲突的发生,又兼顾提高存储空间的利用率)
  • If there are 600 elements, the length of the table is 667~1000.

2. Hash function:

  • The hash value calculated by a good hash function will be evenly distributed in the entire address range of the hash table, thereby reducing conflicts

3. Methods of handling conflicts

​ A good way to handle conflicts can reduce secondary conflicts

  • Development addressing method
    • Linear detection method (advantages: simple conflict resolution, disadvantages: but prone to accumulation problems)
    • Square detection method (advantage: avoid the accumulation problem, disadvantage: not necessarily able to detect all the units on the hash table)
  • Zipper method

1. Open address method (open address method)

基本思想: When there is a conflict, look for the next empty hash address. As long as the hash table is strong enough, the empty hash address can always be found, and the data element is stored

For example: except for the remainder method Hi = (Hash(key) + d) mod md is an incremental sequence

Common methods: linear detection method d is 1, 2, ... m-1 linear sequence

​ The secondary detection method d is the square of 1, the square of -1, the square of 2, the square of -2,..., the quadratic sequence of the square of q

​ Pseudo-random detection method d is a sequence of pseudo-random numbers

2. Chain address method (zipper method)

基本思想: Record identical hash address into a single chain link

​ Set m singly linked lists with m hash addresses, and then use an array to store the head pointers of m singly linked lists to form a dynamic structure

​ The elements with the same hash address are placed in a singly linked list, and the head pointer of the linked list is placed at the corresponding hash address
Insert picture description here

Steps to build hash table in chain address method
  1. Take the key of the data element and calculate its hash function value (address). If the linked list corresponding to the address is empty, insert the element into this linked list; otherwise, proceed to the next step to resolve the conflict
  2. According to the selected conflict handling method, the next storage address of the key is calculated. If the linked list corresponding to the address is not empty, insert the element into this linked list using the pre-interpolation or post-interpolation method of the linked list
Advantages of the chain address method:
  • Non-synonyms will not conflict (no accumulation), no "clustering" phenomenon, so the average search length is shorter

  • Dynamic application for node space on the linked list is more suitable for situations where the length of the list is uncertain

  • The open addressing method requires a relatively small filling factor a in order to reduce conflicts, so when the data size is relatively large, a lot of space will be wasted.

    The filling factor a in the zipper method can be set to >=1, and when the element is large, the pointer field added in the zipper method can be ignored, thus saving space

  • In the hash table constructed by the zipper method, the operation of deleting the node is easier to implement

Disadvantages of the zipper method

The pointer needs additional space, so when the element size is relatively small, the open addressing method saves space. If the saved pointer space is used to expand the scale of the hash table, the filling factor can be reduced, which in turn reduces the open addressing method. Conflicts, thereby increasing the average search speed

4. Hash table search

Insert picture description here
Insert picture description here

For the keyword set (19,14,23,1,68,20,84,27,55,11,10,79), n = 12

Unordered table lookup ASL? 6.5

Ordered table binary search ASL? 3.+

So, look up ASL on the hash table?

5. Analysis of the search efficiency of the hash table

Use the average search length ASL to measure the search algorithm, ASL depends on

  • Hash function

  • Ways to deal with conflicts

  • The filling factor of the hash table a = the number of records filled in the table / the length of the hash table

    a 越大,表中记录数越多,说明表装得越满,发生冲突的可能性越大,查找时比较次数就越多

Insert picture description here

6. Deletion of the hash table

When performing a delete operation on a hash table that uses the open address method to handle conflicts, you cannot simply empty the space of the deleted element, otherwise the search path of the synonym element filled in the hash table after it will be truncated. This is because In various open address methods, the empty address unit is a condition for search failure. Therefore, the deleted element can only be marked for deletion, but the element cannot be deleted.

The zipper table is different from the hash table constructed by the open address method, and the node can be deleted directly

7. Conclusion

  • Hash table technology has a good average performance, better than some traditional technologies
  • Chain address method is better than open address method
  • Divide the remainder method as a hash function is better than other types of functions
  • In fact, the value range of the keyword is much larger than the change range of the hash address
  • When there is a certain mapping relationship between the keywords of a group of data and the storage address, this group of data is suitable for hash table storage
  • In general, assuming that the hash function is uniform, it can be proved that the average lookup length of the hash table obtained by different conflict resolution methods is different.
  • The average lookup length of the hash table is not a function of the number of elements n, but a function of the filling factor a. Therefore, when designing the hash table, you can choose a suitable a to control the average lookup length of the hash table.

Guess you like

Origin blog.csdn.net/weixin_46195957/article/details/111569845