Understanding Hash from the principle

Hash

What is it?

Hash table is also known as hash table, a data structure that stores data in the form of "key-value". The so-called "key-value" form of storing data means that any key uniquely corresponds to a certain location in memory. You only need to enter the value key you are looking for, and you can quickly find its corresponding value. The hash table can be understood as a kind of advanced array. The subscript of this array can be a large integer, floating point number, string or even a structure.

Why does it exist?

Sometimes the order of magnitude of the key code space may be much larger than the actual problem space, causing huge waste. We use a bucket to store it directly or indirectly point to an entry.


Pros and cons?

advantage:
>>>Space utilization:

Problem space N, key code space R, bucket array or hash table, capacity M, then:

N < M << R

Space = O( N + M) = O(N)

M is of the same order as N as much as possible, so the space utilization rate is greatly improved at least compared with using the key code space.

>>>Constant search time:

Because the hash table is accessed by value, the lookup time only needs O(1), which is better.

Disadvantages:
>>>Conflict:

hash(key) = key% M
conflicts are inevitable, but we can minimize conflicts and start from the following two directions:

  1. Carefully design the hash table and hash function to reduce the probability of collision as much as possible;
  2. Develop feasible plans so that conflicts can be resolved as soon as possible.

how to use?

>>>Visit by value
>>>Design of hash function:
  • Dividing remainder: hash(key) = key% M

M should be a prime number. The step size is step, gcd(step,M)=G, if and only if G == 1, the footprint can be spread across the entire hash table. Since the step cannot be determined, M should be a prime number.

  • MAD法 (multiply-add-divide) :hash(key) = ( a * key + b ) % M

The division method has two flaws.

One: It has a fixed point. Regardless of the value of the table length M, there is always hash(0) ≡ 0

2: zero-order uniform. The key codes of [0, R) are equally distributed to M buckets; but the hash addresses of adjacent key codes are also more adjacent.

Take M as a prime number, a> 0, b> 0, a% M != 0. hash(key) = (a * key + b)% M

  • Mid-square: Take the middle bits of key^2 to form the address

Principle: Decompose the squaring operation into a series of left shift operations and several additions. The idea is similar to fast exponentiation, such as 13^2=13 + (13)<<2 + (13)<<3. If you ignore the carry, each digit is obtained by summing the original key several times, so the digits on both sides are accumulated by fewer original digits, and the more centered digit is accumulated by more original digits , Intercepting some of the middle digits can make the influence of each digit of the original key code on the final address closer to each other.

  • 多项式法 :
    h a s h ( s = x 0 , x 1 , x 2 , ⋅ ⋅ ⋅ , x n − 1 ) = x 0 ∗ a n − 1 + x 1 ∗ a n − 2 + ⋅ ⋅ ⋅ + x n − 2 ∗ a 1 + x n − 1 hash( s = x_0,x_1,x_2,···,x_{n-1} ) = x_0*a^{n-1} + x_1*a^{n-2} + ··· + x_{n-2}*a^1 + x_{n-1} hash(s=x0x1x2xn1)=x0an1+x1an2++xn2a1+xn1
    Karp-Rabin算法:串即是数Coincide with the thought of.

  • More hash functions: digital analysis (selecting digits), folding (folding), bit exclusive OR (XOR), pseudo-random number method (#! Use this method with caution), etc.

>>>Resolve conflicts:
  • Independent chain (linker-list chaining / separate chaining)

Each bucket stores a pointer, conflicting entries, and form a list. Closed addressing strategy, each bucket can only store entries that conflict with the address of the bucket unit.

Advantages:

1. No need to reserve multiple slots for each bucket

2. Any number of conflicts can be resolved

3. The delete operation is simple and uniform,

but:

1. The pointer needs additional space

2. The node needs to 动态apply (time cost Two orders of magnitude higher than normal operation)

3. More importantly, the system cache is almost invalid! The search inside each bucket is carried out along the corresponding list order. Prior to this, the insertion and deletion order of each node is random. For any list, the nodes in it are often not continuously distributed in the physical space. , It is impossible to use effective cache to speed up the search. When the length of the hash table is very large and IO is to be used, this problem will be more obvious.

  • Open addressing ~ closed hashing:

The space occupied by the hash table is always physically consistent with the address, and there is no need to apply for additional space. Each entry can be stored in any bucket. A number of spare buckets are agreed in advance for each bucket, and they form a probing sequence/chain.

Search along the search chain and turn to the next bucket unit one by one until it

hits 成功or arrives at an empty bucket 失败

linear probing. Once there is a conflict, the next bucket unit will be tested;

[ hash(key) + 1 ] % M
[ hash(key) + 2 ] % M
[ hash(key) + 3 ] % M
...

Advantages:

1. No additional space is required

. 2. The search chain is local, which can make full use of the system cache and effectively reduce I/O.

However:

1. The operation time is greater than O(1)

2. The conflicts increase. The previous conflicts will lead to subsequent The conflict clustering.

Lazy deletion: You need to pay special attention to deletion when you use it. If you delete it directly, the subsequent entries will be lost-it clearly exists but cannot be accessed. At this time, a lazy deletion is required to mark an entry that needs to be deleted. When the search operation encounters the mark, it turns to the next one to continue searching, and the insertion operation encounters a mark, and the entry is directly inserted here.

  • Quadratic probing:

Both open addressing and closed addressing belong to linear heuristics. One of the problems with linear heuristics is that the distance between the test positions is too small. Most of the positions of the test are concentrated in a relatively small part. Therefore, it may as well as appropriate to open the space between each trial, the square trial is a concrete manifestation of this idea.

Use the square as the distance to determine the next trial barrel unit

[ hash(key) + 1^2 ] % M
[ hash(key) + 2^2 ] % M
[ hash(key) + 3^2 ] % M
[ hash(key) + 4^2 ] % M

Advantages: The

phenomenon of data aggregation has been alleviated. In the search chain, the distance between each bucket increases linearly. Once conflicts, they can jump away smartly.


Disadvantages:

1. If you design external memory, I/O will increase sharply. The squaring heuristic strategy will destroy the locality of data access to a certain extent, and even the function of system cache will be invalid, but this problem is not very serious under normal circumstances. Without loss of generality, the size of the system cache page is 1KB. If the bucket unit only records the corresponding reference, roughly 4 bytes are required. Each cache page can hold at least 256 bucket units, 1 KB 4 B = 256 = 1 6 2 \frac{1 KB}{4B}=256=16^24 B1KB=256=162. In other words, to do an additional I/O swap, 16 conflicts must occur consecutively, and the probability is actually very small.

2. An empty bucket may appear.

For example, {0, 1, 2, 3, 4, 5,… }^2% 12 = {0, 1, 4, 9}

will only involve 4 of them. There is no way to find the remaining 2/3 empty buckets. Here M selected 12 is a composite number, it is not difficult to prove by the knowledge of group number theory, as long as the table length M is a composite number, this situation must happen, because the possible value of n^2% M must be less than [M/2 ]([]为向上取整,()为向下取整,之后不再说明)Kind.

Change the table length to a prime number, such as {0, 1, 2, 3, 4, 5,… }^2% 11 = {0, 1, 4, 9, 5, 3}.

If M is a prime number: n^2% M may have [M/2] kinds of possible values—previously, just before the search chain [M/2], because the general prime number M is an odd number, so this ratio Just over 50%, this is the worst possible situation.

The positive conclusion on this point is: if M is a prime number, but the loading factorλ \lambdaIf λ <0.5, it can be found; otherwise, it is not necessarily.

The contradiction method proves this conclusion:

Suppose there is 0 <= a <b <[M/2], so that along the search chain, the ath and bth terms conflict with each other.
∴ a 2 and b 2 naturally belong to a certain congruence class of M, \therefore a^2 \; and b^2 naturally belong to a certain congruence class of M,a2And b2 fromnaturalbelongstoMofaonewith theItype,i.e.,
a 2 ≡ b 2 (mod Ma2b2(modM)
b 2    −    a 2    =    ( b + a ) ∗ ( b − a ) ≡ 0 ( m o d M ) b^2\;- \;a^2 \; = \; (b+a)*(b-a) ≡ 0 \quad (mod M) b2a2=(b+a)(ba)0(modM)
然而
0    <    b   −   a    <    b   +   a    <    M 0\;<\;b\,-\,a\;< \;b\,+\,a\; < \;M 0<ba<b+a<M

1 < b + a < M 1< b+a < M 1<b+a<M
gets (b+a) and it turns out that M is a non-trivial factor, which contradicts that M is a prime number!

  • Two-way square trial:

Once a conflict occurs, alternate forwards and backwards are tested one by one at intervals of increasing squares.

[ hash(key) + 1^2 ] % M
[ hash(key) - 1^2 ] % M
[ hash(key) + 2^2 ] % M
[ hash(key) - 2^2 ] % M
[ hash(key) + 3^2 ] % M
[ hash(key) - 3^2 ] % M
...

The forward and reverse sub-search chains each contain [m/2] mutually different buckets, but some prime numbers make these two sequences have common buckets except 0.

Conclusion: If the table length is taken as the prime number M = 4 * K +3, it must be guaranteed that the first M items of the search chain are all different.

Proof by contradiction:

M=4*K+3


设正向试探序列的第a步与逆向试探序列的第b步冲突,

而且应当是1<= b , a <= (M/2) 

冲突即 -b^2 和 a^2 是一个同余类

-b^2 ≡ a ^2 (mod M)

设a^2 + b^2 =n
0 ≡ a^2 + b^2 = n (mod M)

所以M是n的一个素因子,

根据费马双平方定理的推论,

n不仅能被M整除,也能被M^2整除

所以M ^2 <= a^2 + b^2 

但是与b , a <= (M/2)矛盾,即不可能成立。
费马双平方定理---任一素数p可表示为一对整数的平方和,当且仅当p%4=1

费马双平方定理的推论---任一自然数n可表示为一对正数的平方和,当且仅当在其素分解中,
形如 M = 4*K+3的每一素因子均为偶数次方。

extend

>>>What kind of hash function is better?

  1. Determinism: The same key is always mapped to the same unit.
  2. Fast efficiency: expected-O(1)
  3. Surjection: Cover the entire hash space as fully as possible
  4. Uniformity: The probability of the key code mapped to each position of the hash table is as close as possible (can effectively avoid the phenomenon of aggregation)

>>> Questions about hashing

The lookup of the hash table only needs O(1). I think the powerful is not only the hash table, but also the idea that hash brings us, such as the Karp-Rabin algorithm, and I think the state pressure also has this Thinking, these are all ideas that compress information and make full use of information. I have attached two check-in questions from the leetcode of the past two days. I think it is still very good.

leetcode974

leetcode287

The content is a summary of the ninth chapter dictionary of Deng Junhui Teacher Deng's online data structure (Part 2) (2020 Spring). (Ms. Deng's class is really good)

Guess you like

Origin blog.csdn.net/Tiooo111/article/details/106381217