Hash algorithm summary

Hash algorithm summary

I recently learned about redis, read the hash table, looked at the underlying implementation of redis, and found that the hash algorithm used by the underlying redis is murmurhash. The first time I heard about this algorithm, I felt that I knew too little about the commonly used algorithms for hash value calculation. I sorted out the comparatively principled viewpoints on the Internet:

Introduction

According to the definition, the hash function can implement a pseudo-random number generator (PRNG). From this perspective, a recognized conclusion can be drawn: the performance comparison between hash functions can be measured by comparing their comparison in pseudo-random generation.

Some commonly used analysis techniques, such as Poisson distribution, can be used to analyze the collision rate of different hash functions on different data. Generally speaking, there is a theoretically perfect hash function for any type of data. This perfect hash function definition is that no collisions have occurred, which means that there are no duplicate hash values. In reality, it is difficult to find a perfect hash function, and the effect of this kind of perfect function in practical applications is quite limited. In practice, it is generally recognized that a perfect hash function is a hash function that generates the least collision on a specific data set.

The problem now is that there are various types of data, some are highly random, and some have graph structures that contain high latitudes. These all make it very difficult to find a universal hash function, even for a specific type. Data, it is not easy to find a better hash function. All we can do is to find a hash function that meets our requirements through trial and error. The hash function can be selected from the following two angles:

1. Data distribution

 One measure is to consider whether a hash function can distribute the hash value of a set of data well. To perform this analysis, you need to know the number of hash values ​​that collide. If you use a linked list to handle collisions, you can analyze the average length of the linked list, or you can analyze the number of groups of hash values.

2. The efficiency of the hash function

Another measure is the efficiency of the hash function to obtain the hash value. Generally, the algorithm complexity of the algorithm including the hash function is assumed to be O(1), which is why the time complexity of searching for data in the hash table is considered to be "average O(1) complexity", In other commonly used data structures, such as graphs (usually implemented as red-black trees), they are considered O(logn) complexity.

A good hash function must be theoretically very fast, stable and determinable. Usually the hash function cannot reach the complexity of O(1), but the hash function is indeed very fast in the linear search of the string hash, and usually the object of the hash function is the smaller primary key identifier. In this way, the whole process should be very fast and stable to some extent.

The hash function introduced in this article is called a simple hash function. They are usually used to hash (hash string) data. They are used to generate a key used in associative containers such as hash tables. These hash functions are not cryptographically secure, and it is easy to generate the exact same hash value by reversing and combining different data.


Hash Methodology

Hash functions are usually defined by their method of generating hash values. There are two main methods:

1. Hash based on addition and multiplication

This method is to traverse the elements in the data and then add an initial value each time, where the added value is related to an element of the data. Usually this calculation of the value of a certain element is multiplied by a prime number.

2. Shift-based hashing

Similar to the addition hash, the shift-based hash also uses every element in the string data, but unlike addition, the latter is more bit shifting. Usually a combination of left shift and right shift, the number of bits shifted is also a prime number. The result of each shifting process is just adding some accumulation calculations, and the final shifting result is the final result. 


Hash function and prime numbers

No one can prove the relationship between prime numbers and pseudo-random number generators, but the best results currently use prime numbers. The pseudo-random number generator is now a statistical thing, not a definite entity, so its analysis can only have some understanding of the overall results, but not how these results are generated. If we can conduct more specific research, perhaps we can better understand which values ​​are more effective, why prime numbers are more effective than other numbers, and why some prime numbers are not. If we can answer these questions with reproducible proofs, then we can Designing a better pseudo-random number generator may also get a better hash function.

The basic concept surrounding the use of prime numbers in hash functions is to use a quality to change the state value of the hash function being processed instead of using other types of numbers. Processing this word means to perform some simple operations on the hash value, such as multiplication and addition. A new hash value obtained in this way must have a higher entropy statistically, which means that it cannot be biased. Simply put, when you multiply a bunch of random numbers with a prime number, the probability that the number you get is 1 at the bit level should be close to 0.5. There is no concrete proof that this inconvenience only occurs when prime numbers are used. This seems to be a self-declared intuitive theory and is followed by some people in the industry.

The best combination of deciding what is correct and even better and the use of hash primes is still a very dark art. There is no single way to claim to be the ultimate universal hash function. The best one can do is to evolve through trial and error and obtain the appropriate hash algorithm to meet its needs for statistical analysis methods.


Bit bias

The bit sequence generator is purely random or deterministic to a certain extent, and can generate bits of a certain state or the opposite state according to a certain probability. This probability is the bit bias. In the case of pure randomness, the bit bias that produces high or low bits should be 50%.

Then in the pseudo-random generator, the algorithm will determine the position bias of the generator in the smallest output module.

Assume that a PRNG produces 8 bits as its output block. For some reason, the MSB is always set high, and the bit bias of the MSB will be set high with a 100% probability. This conclusion is that even if there are 256 possible values ​​for this PRNG, a value less than 128 will never be generated. For simplicity, assuming that the other bits are being generated purely randomly, then there is an equal chance that any value between 128 and 255 will be generated, but at the same time, there is a 0% chance that a value less than 128 will be generated.

All PRNGs, whether they are hash functions, ciphers, msequences, or any other generators that generate bitstreams, will have such a bit bias. Most PRNGs will try to converge the bit bias to a certain value. Stream ciphers are an example, while other generators work better at uncertain bit biases.

Mixing or bit sequence scrambling is a method of generating bit bias in a common equal flow. Although we must be careful to ensure that they do not mix to divergence. A form of mixed use in cryptography is called avalanche. This is where one bit block is replaced or mixed with another block, and the other block produces an output that is fast mixed with other blocks.

As shown in the figure below, the avalanche process begins with one or more blocks of binary data. The i-th layer slice data produced by certain bit operations in the data (usually some input-sensitive bit entry reduction bit logic). Then repeat this process to generate an i+1 layer data in the i-th layer. The number of bits in the current layer will be less than or equal to the number of bits in the previous layer.

This iterative process will result in a bit that depends on all the bits of the previous data. 

 

Various forms of hash

Hash is a tool that maps data to an identifier in the real world. The following are some common areas of hash functions:

1. String hash

In the field of data storage, it is mainly data indexing and structural support for containers, such as hash tables.

2. Cryptographic hash

Used for data/user verification and verification. A strong cryptographic hash function is difficult to get the original data from the result. The cryptographic hash function is used to hash the user's password, which is used to replace the password itself and it is sad that a certain server exists. The cryptographic hash function is also regarded as an irreversible compression function, which can represent a large amount of data identified by a signal. It can be very useful to judge whether the current data has been tampered with (such as MD5), and it can also be used as a data sign to prove it. Encrypt the authenticity of the file by other means.

3.Geometric hash

This hash table is used in the field of computer vision for the detection of classified objects in any scene.

The initial selection process involves a region or object of interest. From there using powerful features such as Harris Angle Detector (HCD), Scale Invariant Feature Transformation (SIFT) or Crash Style (Surfing), a set of functional affine extraction which is regarded as representative of affine invariant feature detection Algorithms represent objects or regions. This set is sometimes referred to as the macro function or the constellation of functions. The nature and type of objects or regions of the discovered features are listed as it may still be possible to match the characteristics of the two constellations, even if there may be slight differences (such as missing or abnormal features) in the two sets. Constellation, and then said that it is the function classification setting.
The hash value is calculated from the characteristics of the constellation. This is usually done by initially defining the hash value of a place in order to live in the space-in this case, the hash value is a multi-dimensional value that normalizes the defined space. Coupled with another process of calculating the hash value, it is a necessary process to determine the distance between the two hash values-a distance measurement is required, not a deterministic equal operator due to the hash of the constellation The value is calculated to the possible gap problem. Also because the simple Euclidean distance metric is essentially invalid, the result is that automatically determining the distance metric for a specific space has become an active field of academic research to deal with the nonlinear nature of this type of space.
The geometric hash includes the purpose of any scene in the re-detection of various car classifications, typical examples. The detection level can be varied, from whether it is a vehicle that has just been detected, to a specific type of vehicle, and in a specific vehicle.

4. Bloom filter

 

Bloom filters allow a very large range of values ​​to be represented by a much smaller memory lock. In computer science, this is the well-known associative query, and is the core idea of ​​associative containers.

The implementation of Bloom Filter is used through a variety of different hash functions, and it can also be used to allow a certain value of a specific value to have a certain error probability for member query results. What the Bloom filter guarantees is that there will never be false negatives for any member country's query, but there may be false positives. The probability of false positives can be controlled by changing the size of the table used by the Bloom filter and the number of different hash functions.
Subsequent research work focused on the fields of hash functions, hash tables and Mitzenmacher's Bloom filter. It is recommended that for this structure, the most practical usage of the entropy of the data being hashed contributes to the entropy of the hash function, which is the theoretical result of concluding an optimal Bloom filter (one providing a given one with the lowest further leading to false The size of the probability table of positives or vice versa) provides the definition of the probability of false positives. Users can build up to two distinct pairwise independent hash functions known as hash functions, greatly improving the query efficiency.
Bloom filters usually exist in applications such as spell checkers, string matching algorithms, network packet analysis tools, and network/Internet caching.

 

Commonly used hash functions

The general hash function library has the following string hash algorithms that mix addition and one-bit operations. The following algorithms are different in usage and function, but they can all be used as examples of learning the implementation of hash algorithms. (See download for other versions of code implementation )

1.RS 

Get it from the book Algorithms in C by Robert Sedgwicks  . I (the original author) have added some simple optimized algorithms to speed up the hashing process.

view plainprint?

  1. public long RSHash(String str)  
  2.    {  
  3.       int b     = 378551;  
  4.       int a     = 63689;  
  5.       long hash = 0;  
  6.       for(int i = 0; i < str.length(); i++)  
  7.       {  
  8.          hash = hash * a + str.charAt(i);  
  9.          a    = a * b;  
  10.       }  
  11.       return hash;  
  12.    }  

 

2.JS

A bit manipulation hash function written by Justin Sobel.

view plainprint?

  1. public long JSHash(String str)  
  2.    {  
  3.       long hash = 1315423911;  
  4.       for(int i = 0; i < str.length(); i++)  
  5.       {  
  6.          hash ^= ((hash << 5) + str.charAt(i) + (hash >> 2));  
  7.       }  
  8.       return hash;  
  9.    }  

 

3.PJW 

The hashing algorithm is based on the research of Peter J Weinberg of Bell Labs. In the book Compilers (Principles, Techniques and Tools), it is recommended to adopt the hash method of the hash function of this algorithm.

view plainprint?

  1. public long PJWHash(String str)  
  2.    {  
  3.       long BitsInUnsignedInt = (long)(4 * 8);  
  4.       long ThreeQuarters     = (long)((BitsInUnsignedInt  * 3) / 4);  
  5.       long OneEighth         = (long)(BitsInUnsignedInt / 8);  
  6.       long HighBits          = (long)(0xFFFFFFFF) << (BitsInUnsignedInt - OneEighth);  
  7.       long hash              = 0;  
  8.       long test              = 0;  
  9.       for(int i = 0; i < str.length(); i++)  
  10.       {  
  11.          hash = (hash << OneEighth) + str.charAt(i);  
  12.          if((test = hash & HighBits)  != 0)  
  13.          {  
  14.             hash = (( hash ^ (test >> ThreeQuarters)) & (~HighBits));  
  15.          }  
  16.       }  
  17.       return hash;  
  18.    }  

 

4.ELF 

It is very similar to PJW and is used more in Unix systems.

view plainprint?

  1. public long ELFHash(String str)  
  2.    {  
  3.       long hash = 0;  
  4.       long x    = 0;  
  5.       for(int i = 0; i < str.length(); i++)  
  6.       {  
  7.          hash = (hash << 4) + str.charAt(i);  
  8.          if((x = hash & 0xF0000000L) != 0)  
  9.          {  
  10.             hash ^= (x >> 24);  
  11.          }  
  12.          hash &= ~x;  
  13.       }  
  14.       return hash;  
  15.    }  

 

5.BKDR

This algorithm comes from  The C Programming Language by Brian Kernighan and Dennis Ritchie . This is a very simple hash algorithm, using a series of strange numbers in the form of 31, 3131, 31...31, which looks very similar to the DJB algorithm. (Refer to my previous blog, this is Java's string hash function)

view plainprint?

  1. public long BKDRHash(String str)  
  2.    {  
  3.       long seed = 131; // 31 131 1313 13131 131313 etc..  
  4.       long hash = 0;  
  5.       for(int i = 0; i < str.length(); i++)  
  6.       {  
  7.          hash = (hash * seed) + str.charAt(i);  
  8.       }  
  9.       return hash;  
  10.    }  

 

6.SDBM

This algorithm is used in the open source SDBM, and it seems that it can get a good distribution for many different types of data.

view plainprint?

  1. public long SDBMHash(String str)  
  2.    {  
  3.       long hash = 0;  
  4.       for(int i = 0; i < str.length(); i++)  
  5.       {  
  6.          hash = str.charAt(i) + (hash << 6) + (hash << 16) - hash;  
  7.       }  
  8.       return hash;  
  9.    }  

 

7.DJB

This algorithm was invented by Professor Daniel J. Bernstein and is currently the most effective hash function published.

view plainprint?

  1. public long DJBHash(String str)  
  2.    {  
  3.       long hash = 5381;  
  4.       for(int i = 0; i < str.length(); i++)  
  5.       {  
  6.          hash = ((hash << 5) + hash) + str.charAt(i);  
  7.       }  
  8.       return hash;  
  9.    }  

 

8.DEK

It is given by the great Knuth in the sixth chapter of "The Art of Programming Volume III" Sorting and Searching.

view plainprint?

  1. public long DEKHash (String str)  
  2.    {  
  3.       long hash = str.length();  
  4.       for(int i = 0; i < str.length(); i++)  
  5.       {  
  6.          hash = ((hash << 5) ^ (hash >> 27)) ^ str.charAt(i);  
  7.       }  
  8.       return hash;  
  9.    }  

 

9.AP 

 

  1. public long APHash (String str)  
  2.    {  
  3.       long hash = 0xAAAAAAAA;  
  4.       for(int i = 0; i < str.length(); i++)  
  5.       {  
  6.          if ((i & 1) == 0)  
  7.          {  
  8.             hash ^= ((hash << 7) ^ str.charAt(i) * (hash >> 3));  
  9.          }  
  10.          else  
  11.          {  
  12.             hash ^= (~((hash << 11) + str.charAt(i) ^ (hash >> 5)));  
  13.          }  
  14.       }  
  15.       return hash;  
  16.    }  

 10, murmurhash

unsigned long long MurmurHash64B ( const void * key, int len, unsigned int seed )
{
const unsigned int m = 0x5bd1e995;
const int r = 24;

unsigned int h1 = seed ^ len;
unsigned int h2 = 0;

const unsigned int * data = (const unsigned int *)key;

while(len >= 8)
{
unsigned int k1 = *data++;
k1 *= m; k1 ^= k1 >> r; k1 *= m;
h1 *= m; h1 ^= k1;
len -= 4;

unsigned int k2 = *data++;
k2 *= m; k2 ^= k2 >> r; k2 *= m;
h2 *= m; h2 ^= k2;
len -= 4;
}

if(len >= 4)
{
unsigned int k1 = *data++;
k1 *= m; k1 ^= k1 >> r; k1 *= m;
h1 *= m; h1 ^= k1;
len -= 4;
}

switch(len)
{
case 3: h2 ^= ((unsigned char*)data)[2] << 16;
case 2: h2 ^= ((unsigned char*)data)[1] << 8;
case 1: h2 ^= ((unsigned char*)data)[0];
h2 *= m;
};

h1 ^= h2 >> 18; h1 *= m;
h2 ^= h1 >> 22; h2 *= m;
h1 ^= h2 >> 17; h1 *= m;
h2 ^= h1 >> 19; h2 *= m;

unsigned long long h = h1;

h = (h << 32) | h2;

return h;
}

Guess you like

Origin blog.csdn.net/qq_20853741/article/details/111995366