Data structure and algorithm-hash function and bloom filter from string to number

1. Bloom Filter

Insert picture description here

It was proposed by Bron in 1970. It is actually a very long binary vector and a series of random mapping functions. Bloom filters can be used to retrieve whether an element is in a collection. Its advantage is that the space efficiency and query time are much better than the general algorithm, but the disadvantage is that it has a certain misrecognition rate and deletion difficulty.

(1) Basic concepts
If you want to judge whether an element is in a set, the general idea is to save all the elements, and then determine by comparison. Linked lists, trees, and other data structures are all this way of thinking. But as the elements in the collection increase, we need more storage space, and the retrieval speed becomes slower and slower (O(n), O(logn)) . However, there is also a data structure called a hash table (also known as a hash table) in the world. It can map an element to a point in a bit array through a Hash function. In this way, we only need to see if this point is 1 to know if it is in the set. This is the basic idea of ​​Bloom filter.

(2) Advantages
Compared with other data structures, Bloom filters have huge advantages in space and time. Bloom filter storage space and insertion/query time are constant. In addition, Hash functions are not related to each other, which is convenient for parallel implementation by hardware. Bloom filter does not need to store the element itself, which has advantages in some occasions with very strict confidentiality requirements. Bloom filters can represent the complete set, but not any other data structure.

(3) Disadvantages
But the disadvantages and advantages of Bloom filters are as obvious. Miscalculation rate is one of them. As the number of stored elements increases, the miscalculation rate increases. The common remedy is to create a small whitelist to store elements that may be misjudged. But if the number of elements is too small, a hash table is sufficient.
In addition, in general, elements cannot be removed from the Bloom filter. It is easy for us to think of turning the bit array into an integer array, adding 1 to the counter corresponding to each element inserted, so that the counter is decremented when the element is deleted. However, to ensure safe deletion of elements is not so simple. First of all, we must ensure that the deleted element is indeed in the Bloom filter. This is not guaranteed by this filter alone. In addition, counter wrap can also cause problems. In terms of reducing the rate of miscalculations, a lot of work has led to the emergence of many variants of Bloom filters.

2. Some hash functions for converting strings to numbers

There are probably several algorithms :

  • BKDRHash
  • APHash
  • DJBHash
  • JSHash
  • RSHash
  • SDBMHash
  • PJWHash
  • ELFHash

The method of use is as follows:

var bling = require("bling-hashes");
var hash = bling.bkdr("Hello world!"); 

Three, Bloom Filter (Bloom Filter) and Hash algorithm

Hash algorithm is also called fingerprint or digest algorithm in applications. It is an algorithm that maps any length of plaintext string into a shorter data string (hash value). The current Hash algorithm is mainly MD5 series Algorithm and SHA system algorithm A good Hash algorithm needs to have four characteristics, namely, fast forward, difficult reverse, sensitive input, and conflict avoidance.

  1. Forward fast: Given plaintext and Hash algorithm, the Hash value can be calculated in limited time and limited resources
  2. Difficulty in reverse engineering: Given the Hash value, it is difficult to reversely derive the plaintext in a limited time
  3. Input sensitivity: For any change in the original input information, the newly generated Hash value should be very different
  4. Conflict avoidance: It is difficult to find two plaintexts with different contents so that their Hash values ​​are consistent. Conflict avoidance is also called collision resistance, which is divided into strong collision resistance and weak collision resistance. If under the premise of a given plaintext, other plaintexts that collide with it cannot be found, the algorithm has weak collision resistance; if it cannot find any two Hash collision plaintexts, the algorithm is said to have strong collision resistance

Since Hash can map any content to a fixed-length string, the probability of different content being mapped to the same string is very low. Therefore, this constitutes a good "content → index" generation relationship. For a given content and storage array, you can construct a suitable Hash function so that the calculated Hash value of the content does not exceed the size of the array, so as to achieve fast content-based search to determine whether an element is in one "Inside the collection" question. However, limiting the mapped Hash value to the range of the array size will cause a large number of Hash conflicts, resulting in a rapid decline in performance, so people designed a Bloom filter based on the Hash algorithm

Bloom filter uses multiple Hash functions to improve space utilization. For the same given input, multiple Hash functions calculate multiple addresses, and mark these addresses in the array as 1 respectively. When searching, perform the same calculation process and check the corresponding elements. If they are all 1, It means that the input is more likely to exist. As shown in the figure below, perform functions such as Hash1, Hash2, HashK, etc. according to the content, and calculate positions such as h1, h2, hk, etc. If these positions are all 1, then [email protected] There is a high probability

Insert picture description here
The reason why there is a high probability is that whether it is a single Hash algorithm or a Bloom filter, the ideas are the same, and they are all based on content encoding, but due to storage limitations, there may be conflicts, that is, both methods There may be false positives, and there will be no false positives. However, the false alarm rate of the Bloom filter in the application is much lower than that of a single Hash algorithm.

In order to thank you for your attention and support, this time I prepared a limited time to receive benefits: Ali interview questions, Baidu interview questions, Didi interview questions, Huawei interview questions, JD interview questions, Meituan interview questions, Tencent interview questions, headline interview questions , ZTE interview questions.
Insert picture description here
What are you waiting for? I recommend my linuxC/C++ language exchange group: [ 1106675687 ] I have compiled some learning books and video materials that I think are better to share in the group files, and you can add them if you need them! The top 100 enter the group to receive, an extra copy of C/C++, linux materials worth 199 is included (video tutorials, e-books, actual combat projects and codes), the following part is displayed.
Insert picture description here
Insert picture description here

Guess you like

Origin blog.csdn.net/m0_50662680/article/details/112539832