Interesting Hash Table Optimization: From Avoiding Hash Conflicts to Exploiting Hash Conflicts

Introduction: This article starts with the traditional design and solution ideas of hash tables, and leads to new design ideas in simple language: from avoiding hash collisions as much as possible, to using appropriate hash collision probability to optimize computing and storage efficiency. The new hash table design shows that the effective application of the parallel processing power of SIMD instructions can greatly improve the hash table's tolerance to hash collisions, thereby improving the query speed, and helping the hash table to perform well. The ultimate storage space compression.

1 Background

Hash table is a data structure with excellent search performance, and it has a wide range of applications in computer systems. Although the theoretical lookup time complexity of a hash table is O(1), there are still huge performance differences in the implementation of different hash tables, so engineers are also exploring better hash data structures. Never stop.

1.1 The core of hash table design

In computer theory, a hash table is a data structure that can map Key to Value storage location through a hash function. Then the core of the hash table design is two points:

1. How to improve the efficiency of mapping Key to Value storage location?

2. How to reduce the space overhead of storing data structures?

Since the storage space overhead is also a core control point in the design, under the limited space, the hash function mapping algorithm has a very high probability of mapping different keys to the same storage location , which is a hash collision . The difference in most hash table designs is how it handles hash collisions.

When encountering hash collisions, there are several common solutions: open addressing, zipper, and quadratic hashing. But below we introduce two interesting and uncommon solutions, and lead to a new implementation of ours - the B16 hash table.

2 Avoid Hash Collision

The traditional hash table's handling of hash collisions will add additional branch jumps and memory accesses, which will make pipelined CPU instruction processing less efficient. Then there must be some people thinking, how can you completely avoid hash collisions? So there is such a function, which is the perfect hash function.

The design of perfect hash functions is often very delicate. For example, the CDZ perfect hash function provided by the CMPH ( http://cmph.sourceforge.net/ ) function library utilizes the mathematical concept of acyclic random 3-part hypergraph . CDZ randomly maps each Key to a hyperedge of a 3-part hypergraph through 3 different Hash functions. If the hypergraph passes the acyclic detection, then each Key is mapped to a vertex of the hypergraph, Then, through a well-designed auxiliary array with the same number of vertices as the hypergraph, the final storage subscript corresponding to the Key is obtained.

The perfect hash function sounds elegant, but it actually has some practical drawbacks:

  • The perfect hash function can often only act on a limited set, that is, all possible keys belong to a superset, and it cannot process keys that have not been seen before;

  • The construction of the perfect hash function has a certain complexity, and there is a probability of failure;

  • The perfect hash function is different from the hash function in cryptography. It is often not a simple mathematical function , but a functional function composed of a data structure + algorithm . It also has storage space overhead, access overhead and additional branches. jump overhead;

However, in specified scenarios, such as read-only scenarios and scenarios with deterministic sets (eg: Chinese character sets), the perfect hash function may perform very well.

3 Exploiting hash collisions

Even without using a perfect hash function, many hash tables deliberately control the probability of hash collisions. The easiest way is to control the space overhead of the hash table by controlling the Load Factor, so that the bucket array of the hash table retains enough holes to accommodate the newly added keys. Load Factor is like a hyperparameter that controls the efficiency of the hash table. Generally speaking, the smaller the Load Factor, the greater the wasted space and the better the performance of the hash table.

But the emergence of some new technologies in recent years has allowed us to see another possibility to resolve hash collisions, which is to make full use of hash collisions.

3.1 SIMD Instructions

SIMD is an acronym for Single Instruction Multiple Data. This type of instruction can use one instruction to operate multiple data, such as these very hot GPUs, which accelerate neural network computing through super-large-scale SIMD computing engines.

The current mainstream CPU processors already have rich SIMD instruction set support. For example, most of the X86 CPUs available to you already support the SSE4.2 and AVX instruction sets, and ARM CPUs also have the Neon instruction set. However, most applications other than scientific computing do not fully utilize SIMD instructions.

3.2 F14 Hash Table

The F14 hash table open sourced by Facebook in the Folly library has a very delicate design, which is to map keys to blocks, and then use SIMD instructions for efficient filtering in blocks. Because the number of blocks is smaller than the traditional bucket, this is equivalent to artificially increasing hash collisions, and then using SIMD instructions in the blocks to resolve the collisions.

The specific approach is as follows:

  • Two hash codes are calculated for the Key through the hash function: H1  and  H2  , where  H1  is used to determine the block to which the Key is mapped, and H2  has only 8 bits, which is used for filtering within the block;

  • Each block can store up to 14 elements, and the block header has 16 bytes. The first 14 bytes of the block header store the  H2  corresponding to the 14 elements . The 15th byte is the control byte, which mainly records how many elements in the block overflow from the previous block. The 16th byte is The out-of-bounds counter mainly records how many elements should be placed if the block space is large enough.

  • When inserting, when the 14 positions in the block to which the key is mapped are still free, insert it directly; when the block is full, increase the out-of-bounds counter and try to insert into the next block;

  • When querying, H1  and  H2  are obtained by calculating the key to be searched  . After  H1  modulo the number of blocks to determine the block to which it belongs, the block header is first read, and the SIMD instruction is used to compare whether  H2  and  H2s  of 14 elements are the same. If there is the same  H2  , then compare whether the Key is the same to determine the final result; otherwise, judge whether the next block needs to be compared according to the 16th byte of the block header.

In order to take full advantage of the parallelism of SIMD instructions, F14 uses  H2 ,  an 8-bit hash value, in the block. Because a 128-bit wide SIMD instruction can perform up to 16 parallel comparisons of 8-bit integers. Although the theoretical collision probability of 1/256 for an 8-bit hash value is not low, it is equivalent to having a probability of 255/256, which saves the overhead of key-by-key comparison and enables the hash table to tolerate a higher collision probability.

4 B16 Hash Table

Regardless of the design inside the block, F14 is essentially an open-addressable hash table. The 15th and 16th bytes of each block header are used for the control strategy of open addressing, and only 14 bytes are left for the hash code, hence the name F14.

Then we consider whether we can start from another angle and use the zipper method to organize the blocks. Since control information is omitted, 16 elements can be placed in each block, which we name B16.

4.1 B16 Hash Data Structure



△B16 Hash table data structure (3-element example)

The above figure shows the data structure of the B16 hash table with 3 elements per block. The green in the middle is the common BUCKET array, which stores the head pointer of the CHUNK zipper in each bucket. Compared with F14, each CHUNK on the right has less control bytes and more next pointer to the next CHUNK.

B16 also calculates two hash codes for Key through a hash function: H1  and  H2  . For example, the two hash codes for "Lemon" are 0x24EB and 0x24, and using  the high bits of H1  as  H2  is generally sufficient.

When inserting, calculate the bucket where the key is located by modulo the bucket size by  H1  . For example, the bucket where "Lemon" is located is 0x24EB mod 3 =1. Then, find the first vacancy in the block zipper of bucket No. 1,   and write the H2 and element corresponding to the key into the block. When the block zipper does not exist, or is already full, create a new block for the zipper to load the inserted element.

When searching, first find the corresponding bucket zipper through  H1  , and then perform H2 comparison based on SIMD instructions for each block   . Load the 16-byte block header of each block into a 128-bit register, which contains 16  H2'  ,  and repeatedly expand H2  into a 128-bit register, and perform 16-way simultaneous comparison through SIMD instructions. If they are different, then compare the next block; if there is the same  H2  , continue to compare whether the key of the corresponding element is the same as the searched key. Until the complete zipper is traversed, or the corresponding element is found.

When deleting, first find the corresponding element, and then cover the corresponding element with the element at the end of the block zipper.

Of course, the number of elements in each block of the B16 hash table can be flexibly adjusted according to the width of SIMD instructions. For example, a block size of 32 elements can be selected by using a 256-bit width instruction. But it is not only the lookup algorithm that affects the performance of the hash table, the speed and continuity of memory access are also very important. A control block size of 16 or less can fully utilize the cache line of the x86 CPU in most cases, which is a better choice.

Ordinary zipper hash table, each node of the zipper has only one element. B16 This block zipper method, with 16 elements per node, creates a lot of holes. In order to keep the holes as few as possible, we must increase the probability of hash collision, that is, reduce the size of the BUCKET array as much as possible. We have found through experiments that when the Load Factor is between 11-13, the overall performance of the B16 is the best. In fact, this is also equivalent to transferring the holes that originally existed in the BUCKET array to the CHUNK zipper, and also saves the overhead of the next pointer of each node of the ordinary zipper.

4.2 B16Compact Hash Data Structure

△B16Compact hash table data structure (3-element example)

B16Compact compresses the hash table structure to the extreme.

First, it omits the next pointer in CHUNK, merges all CHUNKs into an array, and fills up all CHUNK holes. For example, the zipper of BUCKET[1] in [Figure 1] originally had 4 elements, including Banana and Lemon, and the first two elements were added to CHUNK[0] in [Figure 2]. And so on, except for the last CHUNK in the CHUNK array, all CHUNKs are full.

Then it omits the pointer to the CHUNK zipper in BUCKET, and only retains an array index pointing to the CHUNK where the first element of the original zipper is located. For example, the first element of the zipper of BUCKET[1] in [Figure 1] is added to BUCKET[0] in [Figure 2], then only the subscript 0 is stored in the new BUCKET[1].

Finally, a tail BUCKET is added to record the index of the last CHUNK in the CHUNK array.

After such processing, the original elements in each BUCKET zipper are still continuous in the new data structure, each BUCKET still points to the first CHUNK block containing its elements, and the subscript in the next BUCKET is still The last CHUNK block containing its elements can be known. The difference is that each CHUNK block may contain elements of multiple BUCKET zippers. Although there may be more CHUNKs to be searched, since each CHUNK can be quickly filtered by SIMD instructions, the impact on the overall search performance is relatively small.

This read-only hash table only supports lookup, and the lookup process is not much different from the original one. Taking Lemon as an example, first find the corresponding bucket 1 through H1=24EB, and obtain the starting CHUNK subscript of the corresponding zipper of the sub-bucket as 0, and the end CHUNK subscript as 1. Use the same algorithm as B16 to search in CHUNK[0], if no Lemon is found, then continue to search for CHUNK[1] to find the corresponding element.

The theoretical additional storage overhead of B16 Compact can be calculated by the following formula:



where n is the number of elements in the read-only hash table.

When n is 1 million and the Load Factor is 13, the theoretical additional storage cost of the B16Compact hash table is 9.23 bits/key, that is, the additional cost of storing each key is only a little more than 1 byte. This is almost as good as some minimal perfect hash functions without failing the build.

5 Experimental data

5.1 Experimental setup

The key and value types of the hash table used in the experiment are both uint64_t, and the input array of the key and value pair is pre-generated by the random number generator. The hash table is initialized with the number of elements, that is, there is no need to rehash during the insertion process.

  • Insertion performance: It is obtained by dividing the total time of inserting N elements one by one by N, and the unit is ns/key;

  • Query performance: divided by 400,000 by dividing the total time spent by 200,000 random Key queries (all hits) + 200,000 random Value queries (possibly missing), the unit is ns/key;

  • Storage space: obtained by dividing the total allocated space by the size of the hash table, in bytes/key. For the total allocated space, both F14 and B16 have corresponding interface functions that can be directly obtained, and unordered_map is obtained by the following formula:

The Folly library is compiled with - mavx - O2, and the Load Factor uses the default parameters; B16 is compiled with - mavx - O2, and the Load Factor is set to 13; the unordered_map uses the built-in version of the Ubuntu system, and the Load Factor uses the default parameters.

The test server is a 4-core 8G CentOS 7u5 virtual machine, the CPU is Intel(R) Xeon(R) Gold 6148 @ 2.40GHz, and the program is compiled and executed in Ubuntu 20.04.1 LTS Docker.

5.2 Experimental data

△ Insertion performance comparison

The broken line in the above figure shows the insertion performance comparison of unordered_map, F14ValueMap and B16ValueMap, and different columns show the storage cost of different hash tables.

It can be seen that the B16 hash table still provides significantly better insert performance than unordered_map with significantly less storage overhead than unordered_map.

Because the F14 hash table has different automatic optimization strategies for Load Factor, the storage space overhead of F14 fluctuates to a certain extent under different hash table sizes, but the overall storage overhead of B16 is still better than that of F14. The insertion performance of B16 is better than that of F14 when the number of keys is less than 1 million, but it is inferior to that of F14 when the number of keys is 10 million. This may be because the locality of B16 zipper memory access is worse than that of F14 when the amount of data is large.

△Search performance comparison

The broken line in the above figure shows the comparison of the lookup performance of unordered_map, F14ValueMap, B16ValueMap and B16Compact. Different columns show the storage cost of different hash tables.

It can be seen that the B16 and B16Compact hash tables still provide significantly better search performance than unordered_map when the storage overhead is significantly smaller than that of unordered_map.

The lookup performance of the B16 and F14 hash tables is similar to the insertion performance, obviously better than F14 when the key is below 1 million, but slightly worse than F14 when the key is 10 million.

It is worth noting the performance of the B16Compact hash table. Since the Key and Value types of the experimental hash table are both uint64_t, the storage of the Key, Value pair itself needs to consume 16 bytes of space, while the B16Compact hash table is stable for hash tables of different sizes at about 17.31 bytes/key. For storage, this means that the hash structure costs only 1.31 extra bytes per key. The reason why the theoretical overhead of 9.23bits/key is not reached is because our BUCKET array is not extremely compressed using bitpack (which may affect performance), but uses uint32_t.



6 Summary

Inspired by F14, we designed the B16 hash table, using a more understandable data structure, making the implementation logic of adding, deleting, and checking simpler. Experiments show that the storage overhead and performance of B16 are better than F14 in some scenarios.

The new hash table design shows that the effective application of the parallel processing power of SIMD instructions can greatly improve the hash table's tolerance to hash collisions, thereby improving the query speed, and helping the hash table to perform well. The ultimate storage space compression. This makes the design idea of ​​the hash table change from avoiding hash collisions as much as possible to using appropriate hash collision probability to optimize computing and storage efficiency.



Original link: https://mp.weixin.qq.com/s/oeuExiW3DYQnBG8HvDgBAg



----------  END  ----------

Baidu Architect

Baidu's official technical public account is online!

Technical dry goods · Industry information · Online salon · Industry conference

Recruitment information · Internal push information · Technical books · Baidu peripherals

All students are welcome to pay attention!

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324135767&siteId=291194637