Hash table and hash algorithm

Hash table

A hash table (HashTable, also called a hash table) is a data structure that directly accesses a memory storage location according to a key.

Its implementation principle is:
map the element's key to an array index through a hash function (also called a hash function) (the converted value is called a hash value or hash value), and then store the recorded value in the corresponding index position. When we query elements according to the key value, we use the same hash function to convert the key value into the array subscript, and take the data from the corresponding array subscript position:

The hash table uses the characteristics of the array to support random access to data according to the subscript, so the hash table is actually an extension of the array, evolved from the array. It can be said that if there is no array, there is no hash table. PHP's associative arrays are simply based on hash tables.

Hashing technology is both a storage method and a search method. The difference from the previous search method is that there is no logical relationship between the records of the hash technology, so it is mainly a search-oriented data structure. The most suitable problem to solve is to find records where the given values ​​are equal.

For PHPer, you should be familiar with hash tables, because the arrays we use every day are based on hash tables. For example $arr['test'] = 123this code, PHP will underlying key testby hash code is converted to a hash function, and then 123mapped to the hash code. Without considering hash collision, the time complexity of hash table search, delete, and insert are all O (1), which is very efficient.

There are two key concepts in the hash table, one is the hash function (or hash function), and the other is the hash collision (or hash collision).

Hash function

The hash function is used to convert the key value into a hash value after processing. Has the following characteristics:

Hash function computed hash value is non-negative integers
, if key1 == key2, then hash(key1) == hash(key2)
, if key1 != key2, thenhash(key1) != hash(key2)

Hash collision

The so-called hash collision, in simple terms, means that key1 != key2under the circumstances, by the hash function processing hash(key1) == hash(key2), this time, we say that a hash collision occurred. Even a well-designed hash function cannot avoid hash conflicts, because the hash value is a non-negative integer and the total amount is limited, but the key value to be processed in the real world is unlimited, and the unlimited data is mapped to the limited Collection, certainly can not avoid conflict.

In fact, if hash conflicts are not considered, the search efficiency of the hash table is very high, and the time complexity is O (1), which is more efficient than the binary search, but because hash conflicts cannot be avoided, the hash table search The time complexity depends on the hash conflict. The worst case may be O (n), which degenerates into sequential search. This situation is even worse when the hash function is not designed properly.

Hash function design

To reduce hash conflicts and improve the efficiency of hash table operations, it is important to design an excellent hash function. The md5 function that we usually use is a hash function, but there are actually many other custom design implementations. Different scenarios, design different hash functions to reduce hash conflicts, and the hash function itself should also be very simple, otherwise the execution of the hash function itself will become the bottleneck of the hash table. We rarely design hash functions by ourselves, but it is still necessary to do some simple understanding.

There are usually the following methods for constructing hash functions:

  • Direct addressing method: that is f(key) = a*key + b, f represents a hash function, a and b are constants, and key is a key value
  • Digital analysis method: that is to do the left shift, right shift, reverse operation on the number to obtain the hash value
  • Divisor remainder method: That is f(key) = key % p, p represents the number of containers. This method is usually used to store data in a specified container, how to decide which data to put into which container, such as how to insert data after splitting the table (p is split The number of data tables), how distributed Redis stores data (in this case, p represents several Redis servers)
  • Random number method: ie f(key) = random(key), random mechanism such as load balancing

The above are just some hash function design ideas for comparison scenarios, and there are many other design methods, which are not listed here.

Hash collision handling

Even a well-designed hash function cannot completely avoid hash conflicts. We can only optimize our implementation to minimize the occurrence of hash conflicts. If hash conflicts occur, what should we do? Here are some ideas:

  • Open addressing method: This method can be subdivided into three types-linear addressing, secondary detection, and random detection. Linear addressing means that after a hash conflict occurs, the next empty hash address is searched for; the linear addressing step size is 1, the second detection step size is the second power of the linear addressing step size, and the other logic is the same; It is reasonable to randomly detect each step randomly. Regardless of the detection method, when there are not many free positions in the hash table, the probability of hash collision will increase. In order to ensure operational efficiency, we will try to ensure that there is a certain percentage of free slots in the hash table. We use the loading factor To indicate the number of vacancies, the loading factor = fill-in element / hash table length. The larger the loading factor, the fewer free positions, the more conflicts, and the reduced hash table performance.
  • Rehashing function method: after a hash conflict occurs, change a hash function to calculate the hash value
  • Chain address method: After a hash conflict occurs, after linking the corresponding data to the previous value mapped by the hash value, the elements with the same hash value are placed in the linked list corresponding to the same slot. The chain address method can ensure that all data is stored in the hash table even when there are many hash conflicts, but it also introduces performance loss due to traversing the single linked list.

to sum up

After introducing the above, you must know how to build an industrial-level hash table. The main considerations include the following:

  • The hash function is set reasonably, not too complicated, and becomes a performance bottleneck;
  • Set the load factor threshold to support dynamic expansion. The load factor threshold setting must fully balance the time and space complexity;
  • If the one-time capacity expansion takes a long time, you can adopt a batch expansion strategy. After the threshold is reached, only the space is applied, and the data is not moved. Each time a data is inserted afterwards, an old data is moved, and finally the migration is completed gradually. In order to be compatible with the new and old hash table query , You can check the new table first, and then check the old table;
  • Hash conflict resolution method: the open addressing method is selected when the amount of data is small and the load factor is small (less than 1); the linked list method can tolerate a load factor greater than 1, suitable for storing large objects, large data volume hash tables, and more Flexible, supporting more optimization strategies.

Add a picture of the chain address method to deal with hash (hash) conflicts:

The concept and characteristics of the hash algorithm

We shared hash tables, hash functions, and hash conflicts earlier, but they can also be translated into hash tables, hash functions, and hash conflicts. A simple understanding of the hash algorithm is an algorithm that implements the aforementioned hash function, which is used to map a binary value string of any length to a binary string of fixed length. The binary value obtained after the mapping is the hash value (hash value) .

The most common hash algorithm application in our daily development is to encrypt data through the md5 function. Md5 is a hash function. Combined with md5, we can summarize the general characteristics of the hash algorithm:

  • The original data cannot be derived from the hash value in reverse (so the hash algorithm is also called a one-way algorithm, which is irreversible);
  • Very sensitive to input data, even if the original data is modified by only one bit, the resulting hash value is also very different;
  • The probability of hash collision is very small. For different original data, the probability of the same hash value is very small;
  • The execution efficiency of the hash algorithm should be as efficient as possible. For longer texts, the hash value can also be calculated quickly

Application of hash algorithm

1. Scenario 1: Secure encryption

Our daily user password encryption usually uses md5, sha and other hash functions, because the irreversible, and the difference between the encrypted results is very large, so the security is better.

2. Scenario 2: Unique identification

For example, our URL field or picture field requirements cannot be repeated. At this time, we can do md5 processing on the corresponding field value to unify the data to 32-bit length. From the database index construction and query perspective, the effect is better. The binary data of the class is processed by md5 as a unique identifier, so that it is faster when determining duplicate files.

3. Scenario 3: Data verification

For example, many files that we download from the Internet (especially P2P site resources) will contain an MD5 value, which is used to verify the integrity of the downloaded data to avoid the data being hijacked and tampered in the middle.

4. Scene 5: Hash function

As mentioned earlier, md5, sha1, hash and other functions in PHP are based on hash algorithm to calculate the hash value.

5. Scenario 5: Load balancing

For requests on the same client, especially those of logged-in users, we need to route their session requests to the same machine to ensure data consistency. This can be achieved with the help of a hash algorithm, through the user ID tail The number takes the modulus of the total machine number (how many digits can be taken according to the machine number), and the result value is taken as the machine number.

6. Scenario 6: Distributed cache

The distributed cache is different from the distribution of other machines or databases, because the cache data stored by each machine is inconsistent. When the cache machine is expanded, the cache storage machine needs to be reindexed (or partially reindexed). It is also the idea of ​​hash algorithm.

Guess you like

Origin www.cnblogs.com/stringarray/p/12717110.html