Hash tables, hash functions and hash collisions and application scenarios (understand in one article)

1. What is a hash table?

        Hashtable (HashTable, also called hash table): It is extended from the array and is a data structure that directly accesses the memory storage location according to the key (Key).

        The implementation principle is: map the key of the element into an array subscript (the converted value is called a hash value or hash value) through a hash function (also called a hash function), and then store the record in the corresponding subscript position value. When we query elements by key value, we use the same hash function to convert the key value into an array subscript, and get the data from the corresponding array subscript position.

        Icon:

2. What is a hash function?

2.1 Concept explanation

         The hash function is actually a mathematical algorithm that divides the data in an infinite interval into a finite interval evenly after being calculated by the hash function. There are two key concepts in the hash table, one is the hash function (or hash function), and the other is the following hash collision (or hash collision).

         The hash function is used to convert the key value into a hash value after processing. Has the following characteristics:

  • The hash value calculated by the hash function is a non-negative integer
  • If key1 == key2, then hash(key1) == hash(key2)
  • If key1 != key2, then hash(key1) != hash(key2)

        Icon:

 2.2 Common Hash Functions (Hash Functions)

2.2.1 Direct addressing method 

2.2.2 Remainder method 

2.2.3 Digital analysis method 

2.2.4 Square 

method 2.2.5 Folding method (superposition method)  

2.2.6 Random number method 

        There are many ways to construct a hash function. In actual work, the appropriate method should be selected according to different situations. The general principle is to generate as few conflicts as possible. Factors that are usually considered are the length and distribution of keywords, the range of hash values, and so on.

        When the keyword is an integer type, the remainder method can be used; if the keyword is a decimal type, it is better to choose the random number method.

3. What is a hash collision? And how to avoid conflicts?

3.1 Conceptual understanding      

        Hash functions always map different keys to different positions in the array, but in practice, it's almost impossible to write such a hash function.

        For example: There is an array that contains 26 positions, that is, the sequence of 26 English letters. You can use a simple hash function to assign the positions of the array in alphabetical order. If you want to store the price of apples in the hash table, it is naturally the first position (Apple), and the price of bananas is stored in it, naturally the second position (Banner), which looks very smooth, but when we store When pears (Ayocados), the assigned position is the first position again, which obviously conflicts with apples, and this is the conflict .

        In short: The so-called hash conflict , in simple terms, refers to the case of key1 != key2, which is processed by the hash function, hash(key1) == hash(key2). At this time, we say that a hash conflict has occurred. No matter how well-designed hash functions are, hash collisions cannot be avoided. The reason is that the hash value is a non-negative integer, and the total amount of the set is limited, but the key value to be processed in the real world is infinite, and the infinite data is mapped to With a limited set, conflicts cannot be avoided.

3.2 How to avoid hash collisions?

(1) Lower filling factor

        Fill factor calculation formula: fill factor = number of elements in the hash table / total number of positions

        For example: the array has five positions, and there are two occupied positions, then the filling factor = 2/5 = 0.4

        If the fill factor is equal to 1, it is just filled, if it is greater than 1, it means that the number exceeds the number of positions in the array, and you need to add positions (adjust the length). The lower the fill factor, the less likely collisions will occur, and the higher the performance of the hash table. A good rule of thumb is: once the fill factor is greater than 0.7, the length of the hash table should be adjusted.

(2) Good hash function

        A good hash function distributes the values ​​in the array evenly, mapping as wide a range as possible. A bad hash function keeps values ​​clumping together, resulting in a lot of collisions.

What are some good hash functions?

*Open address method:

After the key is hashed, it is found that the value of the place has been occupied, and 1 can be added to the address until an empty address is encountered.

* Re-hashing:

After a "collision" occurs, a portion of the key can be hashed again.

*Chain address method ( zipper method ​​​​​​​) :

The chain address method is to make a linked list by mapping the key to the value at the same address, which is a commonly used method.

 

4. What are the application scenarios?

4.1. Using hash tables for lookups

        Hash tables are suitable for simulating mapping relationships and can be searched with. For example, in the case of searching for a phone on a mobile phone, the phone number is mapped according to the name to search for the phone; there is also the mapping of the IP address based on the domain name to search for the website, etc.

4.2 Prevent duplication

        For example, in the case of voting: who has voted to prevent repeated voting, you can use the hash table. First create a hash table to record the people who have voted. When someone votes, check whether he is in the hash table, if it is in it, return true; otherwise, return false.

4.3 Using Hash Tables for Caching

        Caching is a commonly used acceleration method. All large websites use caching, and the cached data is stored in a hash table. When you visit the website, he first checks whether the page is stored in the hash table, and sends the data in the cache if it is stored, and lets the server do the processing if it is not.

Reference URL: Hash Tables, Hash Functions and Hash Conflicts - Long Fu - Blog Park

Guess you like

Origin blog.csdn.net/qq_35207086/article/details/123348852