Detailed explanation of the principle of hash algorithm

Hash table, which is designed based on the perspective of fast access, is also a typical "space for time" approach. As the name suggests, this data structure can be understood as a linear table, but the elements in it are not closely arranged, but there may be gaps.

Hash table (also called hash table) is a data structure that is directly accessed according to the key value. That is, it accesses records by mapping the key value to a location in the table to speed up lookups. This mapping function is called a hash function, and the array of records is called a hash table.

For example, we store 70 elements, but we may request space for 100 elements for these 70 elements. 70/100=0.7, this number is called the load factor. We also do this for "quick access" purposes. We arrange the storage location of each element based on a fixed function H whose results are as random and evenly distributed as possible, so that linear search of ergodic nature can be avoided and fast access can be achieved. However, due to this randomness, it also inevitably leads to a problem that is conflict. The so-called collision, that is, the addresses obtained by two elements through the hash function H are the same, then these two elements are called "synonyms". It's akin to 70 people going to a restaurant with 100 chairs. The result of the hash function calculation is an address of storage units, each of which is called a "bucket". Suppose a hash table has m buckets, then the value range of the hash function should be [0,m-1].
Conflict resolution is a complex problem. The conflict mainly depends on:
(1) Hash function, the value of a good hash function should be as evenly distributed as possible.
(2) How to deal with conflict.
(3) The size of the load factor. Too big is not necessarily good, and the waste of space is serious. The load factor and the hash function are linked.
Methods to resolve conflicts:
(1) Linear exploration method: After the conflict, linearly test forward to find the nearest empty position. The disadvantage is that there will be accumulation. When accessing, words that may not be synonyms are also located in the probe sequence, affecting efficiency.
(2) Double hash function method: after the collision at position d, use another hash function again to generate a number c that is relatively prime to the hash table bucket capacity m, and then try (d+n*c)%m in turn to make the detection Sequence jump distribution.
Commonly used methods of constructing hash functions

　　Hash functions can make the process of accessing a sequence of data more efficient and efficient. Through hash functions, data elements will be located faster:

　　1. Direct addressing method: Take the keyword or a linear function of the keyword as the hash address. That is, H(key)=key or H(key) = a?key + b, where a and b are constants (this hash function is called its own function)

　　2. Numerical analysis method: Analyze a set of data, such as the date of birth of a group of employees. At this time, we find that the first few digits of the date of birth are roughly the same. In this case, the probability of conflict will be high, but We found that the last few digits of the year, month, and day indicate that the numbers of the month and the specific date are very different. If the latter digits are used to form the hash address, the probability of collision will be significantly reduced. Therefore, the numerical analysis method is to find out the laws of numbers, and use these data as much as possible to construct a hash address with a low probability of collision.

　　3. The square method: take the middle digits after the square of the keyword as the hash address.

　　4. Folding method: Divide the keyword into several parts with the same number of digits, and the number of digits in the last part can be different, and then take the superposition of these parts (remove the carry) as the hash address.

　　5. Random number method: select a random function and take the random value of the keyword as the hash address, which is usually used in occasions with different keyword lengths.

　　6. The remainder method of division: Take the remainder obtained after the keyword is divided by a number p not greater than the length m of the hash table table as the hash address. That is, H(key) = key MOD p, p<=m. Not only can the keyword be modulo directly, but also modulo after folding, squaring and other operations. The choice of p is very important. Generally, a prime number or m is taken. If p is not selected well, it is easy to generate synonyms.
Lookup performance analysis

　　The lookup process of a hash table is basically the same as the table creation process. Some key codes can be found directly through the address converted by the hash function, and other key codes have conflicts in the addresses obtained by the hash function, and need to be searched according to the method of dealing with conflicts. Among the three methods of dealing with conflicts introduced, the search after the conflict is still the process of comparing the given value with the key. Therefore, the measure of the search efficiency of the hash table is still measured by the average search length.

　　During the search process, the number of comparisons of the key code depends on the number of conflicts that occur. The fewer conflicts are generated, the higher the search efficiency is, and the more conflicts are generated, the lower the search efficiency. Therefore, the factors that affect how many conflicts are generated are also the factors that affect the search efficiency. There are three factors that influence how much conflict occurs:

　　1. Whether the hash function is uniform;

　　2. Methods of dealing with conflicts;

　　3. The filling factor of the hash table.

　　The filling factor of the hash table is defined as: α = the number of elements to fill in the table / the length of the hash table

　　α is an indicator factor of how full the hash table is. Since the table length is a fixed value, α is proportional to the "number of elements filled in the table", so the larger α is, the more elements are filled in the table, and the greater the possibility of conflict; The fewer elements you populate a table, the less likely it is to conflict.

　　In fact, the average lookup length of a hash table is a function of the fill factor α, but different methods of handling collisions have different functions.

　　After understanding the basic definition of hash, it is necessary to mention some well-known hash algorithms. MD5 and SHA-1 can be said to be the most widely used hash algorithms at present, and they are all designed based on MD4. So what do they all mean?

　　Here is a brief description:

　　（1) MD4

　　MD4 (RFC 1320) was designed by Ronald L. Rivest of MIT in 1990, MD is the abbreviation of Message Digest. It is suitable for high-speed software implementation on 32-bit word processors – it is based on bit manipulation of 32-bit operands.

　　（2) MD5

　　MD5 (RFC 1321) is an improved version of MD4 by Rivest in 1991. It still groups the input in 512 bits, and its output is a concatenation of 4 32-bit words, the same as MD4. MD5 is more complex than MD4, and it is a little slower, but it is more secure, and it is better in terms of anti-analysis and anti-differentiation

　　(3) SHA-1 and others

　　SHA1 is designed by the NIST NSA to be used with DSA. It produces a hash value of length 160 bits for input of length less than 264, so it is more resistant to brute-force. SHA-1 is designed based on the same principle as MD4, and imitates the algorithm.

　　The phenomenon of collision is inevitable in hash tables: for different keywords, the same hash address may be obtained, that is, key1≠key2, and hash(key1)=hash(key2). Therefore, when building a hash table, not only a good hash function, but also a way to deal with collisions must be set. The hash table can be described as follows: According to the set hash function H(key) and the selected method of handling collisions, a set of keys is mapped to a limited set of addresses (intervals) with consecutive addresses and the The "image" of the key in the address set is used as the storage location of the corresponding record in the table, which is called a hash table.

　　For the dynamic lookup table, 1) the length of the table is uncertain; 2) when designing the lookup table, only the range to which the keyword belongs is known, but the exact keyword is not known. Therefore, in general, a functional relationship needs to be established, and f(key) is used as the key to record the position of the key in the table. This function f(key) is usually called a hash function. (Note: this function is not necessarily a mathematical function)

　　A hash function is an image, that is: a set of keywords is mapped to a set of addresses, and its setting is very flexible, as long as the size of the set of addresses does not exceed the allowable range.

　　In reality, the hash function needs to be constructed, and it can be used well if it is constructed well.

　　So what are these Hash algorithms used for?

　　The application of Hash algorithm in information security is mainly reflected in the following three aspects:

　　(1) File verification

　　The check algorithms we are familiar with include parity check and CRC check. These two kinds of checks do not have the ability to resist data tampering. They can detect and correct channel errors in data transmission to a certain extent, but they cannot prevent data tampering. Malicious destruction of data.

　　The "digital fingerprint" feature of the MD5 Hash algorithm makes it the most widely used file integrity checksum (Checksum) algorithm. Many Unix systems provide commands to calculate the md5 checksum.

　　(2) Digital signature

　　Hash algorithm is also an important part of modern cryptography. Due to the slow operation speed of asymmetric algorithms, one-way hash functions play an important role in digital signature protocols. Digitally signing the hash value, also known as "digital digest", can be considered statistically equivalent to digitally signing the file itself. And there are other advantages to such a protocol.

　　(3) Authentication Protocol

　　The following authentication protocol is also called challenge-authentication mode: it is a simple and secure method in the case where the transmission channel can be intercepted but not tampered with.

file hash

　　MD5-Hash-The digital digest of the file is calculated by the Hash function. Regardless of the file length, its Hash function evaluates to a fixed-length number. Unlike encryption algorithms, this Hash algorithm is an irreversible one-way function. When using Hash algorithms with high security, such as MD5 and SHA, it is almost impossible for two different files to get the same Hash result. Therefore, once a file has been modified, it can be detected.

The Hash function has another meaning. The actual Hash function refers to mapping a large range to a small range. The purpose of mapping a large range to a small range is often to save space and make data easier to save. In addition, Hash functions are often used for search. Therefore, before considering using the Hash function, you need to understand its several limitations:

The main principle of Hash is to map a large range to a small range; therefore, the number of actual values you enter must be equal to or smaller than the small range. Otherwise, there will be many conflicts.
Since Hash approximates a one-way function; therefore, you can use it to encrypt data.
Different applications have different requirements for the Hash function; for example, the Hash function used for encryption mainly considers the difference between it and the single-term function, while the Hash function used for search mainly considers the collision rate that it maps to a small range.
The hash function applied to encryption has been discussed too much, and there is a more detailed introduction in the author's blog. Therefore, this article only discusses the hash function used for lookup.
The main object of the Hash function application is an array (for example, a string), and its target is generally an int type. Below we will explain in this way.
Generally speaking, Hash functions can be simply divided into the following categories:
Additive Hash;
Bit operation Hash;
Multiplication Hash;
Division Hash;
Look up table Hash;
Mixed Hash;
the following describes the application of the above methods in practice in detail.
One addition Hash
The so-called addition Hash is to add the input elements one by one to form the final result. The standard additive hash is constructed as follows:

static int additiveHash(String key, int prime)
{
int hash, i;
for (hash = key.length(), i = 0; i < key.length(); i++)
hash += key.charAt(i);
return (hash % prime);
}
Here prime is any prime number, it can be seen that the range of the result is [0,prime-1].

Two-bit operation Hash
This type of hash function mixes the input elements sufficiently by using various bit operations (common ones are shift and XOR). For example, the standard rotating hash is constructed as follows:

static int rotatingHash(String key, int prime)
{
int hash, i;
for (hash=key.length(), i=0; i
hash = (hash<<4>>28)^key.charAt(i);
return (hash % prime);
}
The main feature of this type of Hash function is to shift first, and then perform various bit operations. For example, the above code for calculating hash can also have the following variants:

hash = (hash<<5>>27)^key.charAt(i);
hash += key.charAt(i);
hash += (hash << 10);
hash ^= (hash >> 6);
if ((i&1) == 0)
{
hash ^= (hash<<7>>3);
}
else
{
hash ^= ~((hash<<11>>5));
}
hash += (hash<<5 >
hash = key.charAt(i) + (hash<<6>>16) ? hash;
hash ^= ((hash<<5>>2));
triple multiplication Hash
This type of Hash function uses multiplication Irrelevance (this property of multiplication is best known as the random number generation algorithm that squares the head and tail, although this algorithm does not work well). For example,

static int bernstein(String key)
{
int hash = 0;
int i;
for (i=0; i
return hash;
}
The hashCode() method of the String class in jdk5.0 also uses multiplication Hash. However, it uses the multiplication Hash. The number is 31. The recommended multipliers are: 131, 1313, 13131, 131313, etc. Well
-known hash functions that use this method are:

// 32-bit FNV algorithm
int M_SHIFT = 0;
public int FNVHash(byte[] data)
{
int hash = (int)2166136261L;
for(byte b : data)
hash = (hash * 16777619) ^ b;
if (M_SHIFT = = 0)
return hash;
return (hash ^ (hash >> M_SHIFT)) & M_MASK;
}
and the improved FNV algorithm:

public static int FNVHash1(String data)
{
final int p = 16777619;
int hash = (int)2166136261L;
for(int i=0;i
hash = (hash ^ data.charAt(i)) * p;
hash += hash << 13;
hash ^= hash >> 7;
hash += hash << 3;
hash ^= hash >> 17;
hash += hash << 5;
return hash;
}
In addition to multiplying by a fixed number, the common And multiplying by a constantly changing number, like:

static int RSHash(String str)
{
int b = 378551;
int a = 63689;
int hash = 0;

 for(int i = 0; i < str.length(); i++)
 {
    hash = hash * a + str.charAt(i);
    a    = a * b;
 }
 return (hash & 0x7FFFFFFF);

}
Although the application of the Adler32 algorithm is not as extensive as that of CRC32, it may be the most famous one in the multiplication Hash. For its introduction, you can go to the RFC 1950 specification.

Four division Hash
division, like multiplication, has the same apparent irrelevance. However, because division is too slow, this method has almost no real application. It should be noted that the purpose of dividing the result of the hash by a prime we saw earlier is only to ensure the range of the result. If you don't need it to limit a range, you can use the following code instead of "hash%prime": hash = hash ^ (hash>>10) ^ (hash>>20).
Five Lookup
Table Hash The most famous example of Hash lookup table is the CRC series algorithm. Although the CRC series algorithm itself is not a look-up table, but a look-up table is one of its fastest implementations. Below is the implementation of CRC32:

static int crctab[256] = {
0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f, 0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988, 0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2, 0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7, 0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9, 0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172, 0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c, 0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59, 0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423, 0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924, 0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106, 0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5,0xe8b8d433, 0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d, 0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e, 0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950, 0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65, 0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7, 0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0, 0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa, 0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f, 0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81, 0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a, 0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84, 0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d,0x0a00ae27, 0x7d079eb1, 0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb,
0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc, 0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e, 0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b, 0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55, 0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236, 0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28, 0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d, 0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f, 0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38, 0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242, 0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777, 0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff,0xf862ae69, 0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2, 0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc, 0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9, 0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693, 0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94, 0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d
};
int crc32(String key, int hash)
{
int i;
for (hash=key.length(), i=0; i
hash = (hash >> 8) ^ crctab[(hash & 0xff) ^ k.charAt (i)];
return hash;
}
Famous examples of table Hash are: Universal Hashing and Zobrist Hashing. Their tables are all randomly generated.

Six Hybrid Hash
Hybrid Hash algorithm utilizes the above methods. Various common Hash algorithms, such as MD5 and Tiger, belong to this range. They are generally rarely used in lookup-oriented hash functions.

Detailed explanation of the principle of hash algorithm

Guess you like