Hashing: Exploring Fast Data Storage and Search Methods

Hashing: Exploring Fast Data Storage and Search Methods

As an efficient data storage structure, the hash table can establish a one-to-one mapping relationship between the storage location of the data and the key code, thereby speeding up the search speed of elements. However, hashing methods also face the problem of hash collisions, that is, different keywords calculate the same hash address through the same hash function. How to handle hash collisions becomes an important issue.

This blog will introduce the concept of hashing, the handling of hash conflicts, common hash function design principles and specific implementation methods, and related application scenarios of hashing ideas. By studying this blog, you will be able to deeply understand the principle of hash tables, and be able to choose an appropriate hash function to resolve hash collisions.

If you are interested in hashing, or want to know how to design a suitable hash function to resolve hash collisions, then this blog will provide you with some valuable reference and guidance. Let's explore the mysteries of hash together!

The concept of hashing

In the sequential structure and balanced tree , there is no corresponding relationship between the element key and its storage location, so when looking for an element, it must go through multiple comparisons of the key. The time complexity of sequential search is O(N), and the height of the tree in the balanced tree is O(log 2 N) . The efficiency of search depends on the number of comparisons of elements during the search process.
Ideal search method: the element to be searched can be directly obtained from the table at one time without any comparison.
If a storage structure is constructed, and a one-to-one mapping relationship can be established between the storage location of an element and its key code through a certain function (hashFunc), then the element can be found quickly by this function when searching.
When adding to this structure:

  • Insert element
    According to the key code of the element to be inserted, use this function to calculate the storage location of the element and store it according to this location
  • Search elements
    Perform the same calculation on the key code of the element, take the obtained function value as the storage location of the element, and compare the elements according to this position in the structure, if the key codes are equal, the search is successful

This method is the hash (hash) method, the conversion function used in the hash method is called the hash (hash) function, and the constructed structure is called the hash table (or hash table)

For example: data set {1, 7, 6, 4, 5, 9};
the hash function is set to: hash(key) = key % capacity ; capacity is the total size of the underlying space of the storage element.

Searching with this method does not require multiple comparisons of key codes, so the search speed is relatively fast.
Question: According to the above hash method, what problems will occur when element 44 is inserted into the collection?

It is foreseeable that, assuming no expansion, hash(44)=44%10=4, then there is already a number in the position of 4, where should 44 be placed? This leads to hash collisions.

hash collision

For the key sum (i != j) of two data elements k_i, k_jthere is k_i!= k_j, but: Hash( k_i) == Hash( k_j),

That is: different keywords calculate the same hash address through the same hash number, this phenomenon is called hash collision or hash collision.

Data elements with different keys but the same hash address are called "synonyms".
How to deal with hash collisions?

It depends on how to deal with the mapping relationship, that is, the design of the hash function

hash function

One reason for the hash collision may be that the design of the hash function is not reasonable enough.
Hash function design principles :

  • The domain of the hash function must include all key codes that need to be stored, and if the hash table allows m addresses, its value range must be between 0 and m-1
  • The addresses calculated by the hash function can be evenly distributed in the entire space
  • The hash function should be relatively simple

Common Hash Functions

  1. Direct addressing method – (commonly used)
    take a linear function of the keyword as the hash address: Hash (Key) = A*Key + B
    Advantages: Simple, uniform, no hash conflicts
    Disadvantages: Need to know the distribution of keywords in advance Situation
    Usage scenario: suitable for finding relatively small and continuous situations

  2. Remainder method – (commonly used)
    Let the number of addresses allowed in the hash table be m, take a prime number p that is not greater than m, but closest to or equal to m as the divisor, according to the hash
    function: Hash(key) = key% p (p<=m), convert the key code into a hash address

    Disadvantages: there are hash conflicts, focus on solving hash conflicts

  3. Square method – (understand)
    Assuming that the keyword is 1234, its square is 1522756, and the middle 3-digit 227 is extracted as the hash address;
    another example is the keyword 4321, its square is 18671041, and the middle 3-digit 671 is extracted (or 710) is more suitable as the square method of the hash address: the distribution of keywords is not known, and the number of digits is not very large

  4. Folding method – (Understanding)
    The folding method is to divide the keyword from left to right into several parts with equal digits (the last part can be shorter), and then superimpose and sum these parts, and make them long according to the hash table, Take the last few digits as the hash address.
    The folding method is suitable for the distribution of keywords that do not need to be known in advance, and is suitable for situations where there are many keywords

  5. Random number method – (understand)
    choose a random function, take the random function value of the keyword as its hash address, that is, H(key) = random(key), where random is a random number function.
    This method is usually used when the length of keywords is not equal

  6. Mathematical analysis method - (understanding)
    there are n d-digits, each of which may have r different symbols, the frequencies of these r different symbols appearing on each bit are not necessarily the same, and they may be distributed and compared on some bits Evenly, each symbol has an equal chance of appearing, and the distribution is uneven on some bits, only certain symbols appear frequently. According to the size of the hash table, a number of bits in which various symbols are evenly distributed can be selected as the hash
    address. For example:

Suppose you want to store the employee registration form of a certain company. If you use the mobile phone number as the keyword, it is very likely that the first 7 digits are the same. Then we can choose the last four digits as the hash address. If such extraction is easy If there is a conflict, you can also reverse the extracted numbers (such as changing 1234 to 4321), shifting the right ring (such as changing 1234 to 4123), shifting the left ring, superimposing the first two numbers and the last two numbers (such as changing 1234 to 12+34=46) and other methods.
The number analysis method is usually suitable for dealing with the situation where the number of keywords is relatively large. If the distribution of keywords is known in advance and the distribution of several bits of keywords is relatively uniform. Note: the more sophisticated the hash
function design, the possibility of hash conflicts The lower, but can not avoid hash collision

Solution to hash collision

Two common methods for resolving hash collisions are: closed hashing and open hashing

closed hash

Closed hashing: also known as open addressing method, when a hash conflict occurs, if the hash table is not full, it means that there must be an empty space in the hash table, then the key can be stored in the "bottom" of the conflicting position a" to go in the empty slot. So how to find the next empty position
?

There are two ways: linear detection and quadratic detection

linear detection

The concept of linear detection: starting from the position where the conflict occurs, probe backwards in turn until the next empty position is found.

Take a chestnut:

Take the picture we mentioned before as an example, now we need to insert element 44, first calculate the hash address through the hash function, hashAddr is 4, so 44 should be inserted in this position theoretically, but the value has already been placed in this position is 4 elements, that is, a hash collision occurs. What to do with linear probing? Probe backwards from 4 until the next empty position is found.

  • insert

    • Obtain the position of the element to be inserted in the hash table through the hash function
    • If there is no element in this position, insert a new element directly. If there is an element in this position, a hash collision occurs, use linear detection to find the next empty position, and insert a new element
  • delete

    • When using closed hashing to handle hash conflicts, existing elements in the hash table cannot be physically deleted casually. If elements are directly deleted, the search for other elements will be affected. For example, to delete element 4, if it is deleted directly, the lookup of 44 may be affected. So linear probing uses marked pseudo-deletion to delete an element.

    • // 哈希表每个空间给个标记
      // EMPTY此位置空, EXIST此位置已经有元素, DELETE元素已经删除
      enum State{EMPTY, EXIST, DELETE};
      
  • Implementation of linear probing

    // 注意:假如实现的哈希表中元素唯一,即key相同的元素不再进行插入
    // 为了实现简单,此哈希表中我们将比较直接与元素绑定在一起
    template<class K, class V>
        class HashTable
        {
            struct Elem
            { 
                pair<K, V> _val;
                State _state;
            };
    
            public:
            HashTable(size_t capacity = 3)
                : _ht(capacity), _size(0)
                {
                    for(size_t i = 0; i < capacity; ++i)
                        _ht[i]._state = EMPTY;
                }
    
            bool Insert(const pair<K, V>& val)
            {
                // 检测哈希表底层空间是否充足
                // _CheckCapacity();
                size_t hashAddr = HashFunc(key);
                // size_t startAddr = hashAddr;
                while(_ht[hashAddr]._state != EMPTY)
                {
                    if(_ht[hashAddr]._state == EXIST && _ht[hashAddr]._val.first== key)
                        return false;
    
                    hashAddr++;
                    if(hashAddr == _ht.capacity())
                        hashAddr = 0;
                    /*
         // 转一圈也没有找到,注意:动态哈希表,该种情况可以不用考虑,哈希表中元
    素个数到达一定的数量,哈希冲突概率会增大,需要扩容来降低哈希冲突,因此哈希表中元素是
    不会存满的
         if(hashAddr == startAddr)
           return false;
         */
                }
    
                // 插入元素
                _ht[hashAddr]._state = EXIST;
                _ht[hashAddr]._val = val;
                _size++;
                return true;
            }
            int Find(const K& key)
            {
                size_t hashAddr = HashFunc(key);
                while(_ht[hashAddr]._state != EMPTY)
                {
                    if(_ht[hashAddr]._state == EXIST && _ht[hashAddr]._val.first== key)
                        return hashAddr;
                    hashAddr++;
                }
                return hashAddr;
            }
            bool Erase(const K& key)
            {
                int index = Find(key);
                if(-1 != index)
                {
                    _ht[index]._state = DELETE;
                    _size++;
                    return true;
                }
                return false;
            }
            size_t Size()const;
            bool Empty() const;  
            void Swap(HashTable<K, V, HF>& ht);
            private:
            size_t HashFunc(const K& key)
            {
                return key % _ht.capacity();
            }
            private:
            vector<Elem> _ht;
            size_t _size;
        };
    

    Thinking: Under what circumstances does the hash table expand? How to expand?

    The load factor of the hash table is defined as: a = the number of elements filled in the table/the length of the hash table
    a is a sign factor of the fullness of the hash table. Since the table length is a fixed value, a is proportional to the "number of elements filled in the table", so the larger a is, the more elements are filled in the table, the greater the possibility of conflict; conversely, a The smaller it is, the fewer elements are marked to be filled in the table, and the less likely it is to cause conflicts. In fact, the average lookup length of the hash table is a function of the load factor a, but different methods of dealing with conflicts have different functions.

    For the open addressing method, the load factor is a particularly important factor and should be strictly limited below 0. 7-0. 8 . Exceeding 0.8, the CPU cache miss (cache missing) during table lookup rises according to an exponential curve. Therefore, some hash libraries using the open addressing method, such as the Java system library, limit the load factor to 0.75. If this value exceeds this value, the hash table will be resized.

    void CheckCapacity()
    {
      if(_size * 10 / _ht.capacity() >= 7)
     {
        HashTable<K, V, HF> newHt(GetNextPrime(ht.capacity));
        for(size_t i = 0; i < _ht.capacity(); ++i)
       {
          if(_ht[i]._state == EXIST)
            newHt.Insert(_ht[i]._val);
       }
       
        Swap(newHt);
     }
    }
    

    Advantages of linear detection: the implementation is very simple,
    disadvantages of linear detection: once a hash conflict occurs, all the conflicts are connected together, which is easy to generate data "accumulation", that is, different key codes occupy available empty positions, making it difficult to find a certain key code The position of requires many comparisons, resulting in reduced search efficiency. How to alleviate it?

secondary detection

The defect of linear detection is that conflicting data are piled up together, which has something to do with finding the next empty position, because the way to find empty positions is to find one after another, so in order to avoid this problem, the second detection should find the next The method of empty position is: H_i= ( H_0+ i^2)% m, or: H_i= ( H_0- i^2)% m. Where: i =1,2,3..., H_0 is the position obtained by calculating the key code key of the element through the hash function Hash(x), and m is the size of the table.
For the above, if you want to insert 44, there will be a conflict, and the situation after using the solution is:

​ Research shows that when the length of the table is a prime number and the table load factor a does not exceed 0.5, new entries must be inserted, and any position will not be probed twice. Therefore, as long as there are half of the empty positions in the table, there will be no problem of the table being full. You can ignore the fullness of the table when searching, but you must ensure that the load factor a of the table does not exceed 0.5 when inserting. If it exceeds, you must consider increasing the capacity.
Therefore: the biggest disadvantage of hashing is that the space utilization rate is relatively low, which is also the defect of hashing.

So is there a solution?

open hash

The concept of open hashing

The open hash method is also called the chain address method (open chain method). First, the hash function is used to calculate the hash address for the key code set. The key codes with the same address belong to the same sub-set. Each sub-set is called a bucket. The elements in the bucket are linked through a singly linked list, and the head nodes of each linked list are stored in the hash table.


​ As can be seen from the above figure, each bucket in the open hash contains elements that have hash collisions.

open hash implementation

template<class V>
    struct HashBucketNode
    {
        HashBucketNode(const V& data)
            : _pNext(nullptr), _data(data)
            {}
        HashBucketNode<V>* _pNext;
        V _data;
    };
// 本文所实现的哈希桶中key是唯一的
template<class V>
    class HashBucket
    {
        typedef HashBucketNode<V> Node;
        typedef Node* PNode;
        public:
        HashBucket(size_t capacity = 3)
            : _size(0)
            { 
                _ht.resize(GetNextPrime(capacity), nullptr);
            }

        // 哈希桶中的元素不能重复
        PNode* Insert(const V& data)
        {
            // 确认是否需要扩容。。。
            // _CheckCapacity();

            // 1. 计算元素所在的桶号
            size_t bucketNo = HashFunc(data);

            // 2. 检测该元素是否在桶中
            PNode pCur = _ht[bucketNo];
            while(pCur)
            {
                if(pCur->_data == data)
                    return pCur;

                pCur = pCur->_pNext;
            }

            // 3. 插入新元素
            pCur = new Node(data);
            pCur->_pNext = _ht[bucketNo];
            _ht[bucketNo] = pCur;
            _size++;
            return pCur;
        }

        // 删除哈希桶中为data的元素(data不会重复),返回删除元素的下一个节点
        PNode* Erase(const V& data)
        {
            size_t bucketNo = HashFunc(data);
            PNode pCur = _ht[bucketNo];
            PNode pPrev = nullptr, pRet = nullptr;

            while(pCur)
            {
                if(pCur->_data == data)
                {
                    if(pCur == _ht[bucketNo])
                        _ht[bucketNo] = pCur->_pNext;
                    else
                        pPrev->_pNext = pCur->_pNext;

                    pRet = pCur->_pNext;
                    delete pCur;
                    _size--;
                    return pRet;
                }
            }

            return nullptr;
        }

        PNode* Find(const V& data);
        size_t Size()const;
        bool Empty()const;
        void Clear();
        bool BucketCount()const;
        void Swap(HashBucket<V, HF>& ht);
        ~HashBucket();
        private:
        size_t HashFunc(const V& data)
        {
            return data%_ht.capacity();
        }
        private:
        vector<PNode*> _ht;
        size_t _size;    // 哈希表中有效元素的个数
    };

open hash capacity increase

The number of buckets is fixed. With the continuous insertion of elements, the number of elements in each bucket will continue to increase. In extreme cases, there may be a lot of linked list nodes in a bucket, which will affect the performance of the hash table. Therefore, under certain conditions, it is necessary to increase the capacity of the hash table, how to confirm the condition?

The best situation of hashing is: there is exactly one node in each hash bucket, and when you continue to insert elements, hash collisions will occur every time. Therefore, when the number of elements is exactly equal to the number of buckets, you can give Hash table expansion.

void _CheckCapacity()
{
 size_t bucketCount = BucketCount();
 if(_size == bucketCount)
 {
     HashBucket<V, HF> newHt(bucketCount);
     for(size_t bucketIdx = 0; bucketIdx < bucketCount; ++bucketIdx)
     {
         PNode pCur = _ht[bucketIdx];
         while(pCur)
         {
             // 将该节点从原哈希表中拆出来
             _ht[bucketIdx] = pCur->_pNext;

             // 将该节点插入到新哈希表中
             size_t bucketNo = newHt.HashFunc(pCur->_data);
             pCur->_pNext = newHt._ht[bucketNo];
             newHt._ht[bucketNo] = pCur;
             pCur = _ht[bucketIdx];
         }
     }

     newHt._size = _size;
     this->Swap(newHt);
 }
}

Open hash thinking

  1. Only elements whose key is an integer can be stored, how to solve other types?

    // 哈希函数采用处理余数法,被模的key必须要为整形才可以处理,此处提供将key转化为整形的方法
    // 整形数据不需要转化
    template<class T>
        class DefHashF
        {
            public:
            size_t operator()(const T& val)
            {
                return val;
            }
        };
    // key为字符串类型,需要将其转化为整形
    class Str2Int
    {
        public:
        size_t operator()(const string& s)
        {
            const char* str = s.c_str();
            unsigned int seed = 131; // 31 131 1313 13131 131313
            unsigned int hash = 0;
            while (*str)
            {
                hash = hash * seed + (*str++);
            }
    
            return (hash & 0x7FFFFFFF);
        }
    };
    // 为了实现简单,此哈希表中我们将比较直接与元素绑定在一起
    template<class V, class HF>
        class HashBucket
        {
            // ……
            private:
            size_t HashFunc(const V& data)
            {
                return HF()(data.first)%_ht.capacity();
            }
        };
    
    1. In addition to the remaining remainder method, it is best to modulo a prime number, how to quickly get a prime number similar to a doubling relationship each time?

      size_t GetNextPrime(size_t prime)
      {
          const int PRIMECOUNT = 28;
          static const size_t primeList[PRIMECOUNT] =
          {
              53, 		97, 		193, 		389, 		769,
              1543, 		3079, 		6151, 		12289, 		24593,
              49157, 		98317, 		196613, 	393241, 	786433,
              1572869, 	3145739, 	6291469, 	12582917,	25165843,
              50331653, 	100663319, 	201326611, 	402653189,	805306457,
              1610612741, 3221225473, 4294967291
          };
          size_t i = 0;
          for (; i < PRIMECOUNT; ++i)
          {
              if (primeList[i] > prime)
                  return primeList[i];
          }
          return primeList[i];
      }
      

    For a more specific description, please refer to this blog ☞String Hash Algorithm

Open and Closed Hash Comparison

Applying the chain address method to deal with overflow requires adding a link pointer, which seems to increase storage overhead.

In fact:
Since the open address method must maintain a large amount of free space to ensure search efficiency, such as the secondary detection method requires a load factor a <=0.7, and the space occupied by the table entry is much larger than the pointer, so using the chain address method instead Compared with the open address method, it saves storage space.

Application of hash

bitmap

The so-called bitmap is to use each bit to store a certain state, which is suitable for scenarios where there is a large amount of data and the data is not repeated. It is usually used to judge whether a certain data exists or not.

bitmap application

application one

Give 4 billion unique unsigned integers, not sorted. Given an unsigned integer, how to quickly determine whether a number is among the 4 billion numbers. 【Tencent】

what to do?

First analyze the topic, 1G=1024MB=1024*1024KB=1024*1024*1024Byte is approximately equal to 1 billion Byte, and the space occupied by 4 billion non-repeating unsigned integers is approximately equal to 15~16G

Method 1: Search a binary tree or a hash table, but the amount of data is too large. Although the time complexity is O(n), it cannot be stored in memory

Method 2: External sorting, and then use binary search. As before, the data is too large and can only be placed on the disk. It is not easy to support binary search and the efficiency is low.

Method 3: bitmap, direct value method, a bit-mapped tag value, 1 means it is present, 0 means it is not present

We directly use the data structure of the smallest byte: char is a unit

How to make the data correspond to the location?

design a function

Ideas:

Let it divide by 8 to know which char it is in, and take the remainder of 8 to know which position it is in

Then if you want to take the corresponding position 1, you need or equal to it (1<<j), and 1 is moved to the left by J positions, using the principle of OR operation,或1均为1

void set(size_t x)
{
    size_t i=x/8;
    size_t j=x%8;
    
    _byt[i] | = (1<<j);
}

How to delete it?

Also find that position, set it to 0, 1 is moved to the left by J positions, then negate, and then use the nature of AND

void reset(size_t x)
{
    size_t i=x/8;
    size_t j=x%8;

    _byt[i] & = ~(1<<j);
}

How to find out whether the data exists?

Find that location and let it compare with(1<<j)

void test(size_t x)
{
    size_t i=x/8;
    size_t j=x%8;

    return _byt[i]&(1<<j);
}
Application two
  1. Given 10 billion integers, design an algorithm to find the one that occurs only once?

  2. 1 file has 10 billion ints, 1G memory, design an algorithm to find all integers that appear no more than 2 times

We can make modifications on the basis of the previous question, we can use two bits to represent the state

You can use only one array

0 times -> 00

1 time -> 0 1

2 times -> 1 0

3 times and above -> 11

But it's too much trouble, you might as well use two arrays for one-to-one correspondence

Code:

template<size_t N>
class bitset
{
    public:
    bitset()
    {
        _bits.resize(N/8+1, 0);
    }

    void set(size_t x)
    {
        size_t i = x / 8;
        size_t j = x % 8;

        _bits[i] |= (1 << j);
    }

    void reset(size_t x)
    {
        size_t i = x / 8;
        size_t j = x % 8;

        _bits[i] &= ~(1 << j);
    }

    bool test(size_t x)
    {
        size_t i = x / 8;
        size_t j = x % 8;


        return _bits[i] & (1 << j);
    }

    private:
    vector<char> _bits;
};
template<size_t N>
class twobitset
{
    public:
    void set(size_t x)
    {
        bool inset1 = _bs1.test(x);//查询是否为1
        bool inset2 = _bs2.test(x);

        // 00
        if (inset1 == false && inset2 == false)
        {
            // -> 01
            _bs2.set(x);
        }
        else if (inset1 == false && inset2 == true)
        {
            // ->10
            _bs1.set(x);
            _bs2.reset(x);
        }
        //第二个问题是同样的道理,只要加上10-》11的条件判断即可,最后需要判断不超过两次就是00和01即可
        /*else if (inset1 == true && inset2 == false)
        {
            // ->11
            _bs1.set(x);
            _bs2.set(x);
        }*/
    }

    void print_once_num()
    {
        for (size_t i = 0; i < N; ++i)
        {
            if (_bs1.test(i) == false && _bs2.test(i) == true)
                //如果是01的状态说明只出现一次
            {
                cout << i << endl;
            }
        }
    }

    private:
    bitset<N> _bs1;
    bitset<N> _bs2;
};
Application three

Given two files, each with 10 billion integers, we only have 1G memory, how to find the intersection of the two files? (roughly)

The same idea as before, as long as two arrays are designed, and then the positions where the mutual mapping is 1 is the intersection

Bitmap Features
  1. Fast, space-saving, conflict-free using the direct valuation method
  2. Relatively limited, only mapping and shaping

bloom filter

Bloom filter proposes

When we use the news client to watch the news, it will continuously recommend new content to us, and it will repeat it every time it recommends, and remove the content that has already been seen. Here comes the question, how does the news client recommendation system realize push deduplication? The server records all the historical records that the user has seen. When the recommendation system recommends news, it will filter the historical records of each user and filter out those records that already exist. How to quickly find it?

  1. Use a hash table to store user records, disadvantages: waste of space
  2. Use a bitmap to store user records. Disadvantages: Bitmaps can generally only handle shaping. If the content number is a string, it cannot be processed.
  3. Combining hashes with bitmaps, i.e. Bloom filters

Bloom filter concept

The Bloom filter is a compact and clever probabilistic data structure proposed by Burton Howard Bloom in 1970. It is characterized by efficient insertion and query , and can be used to tell you "something must not Exist or may exist" , it uses multiple hash functions to map a data into a bitmap structure. This method can not only improve query efficiency, but also save a lot of memory space.

Bloom filter insertion

Insert into Bloom filter: "baidu"

Letting one correspond to multiple locations like this valcan reduce the probability of conflicts

But how to calculate this value?

There are many ways, here are only common hash functions

struct BKDRHash
{
    size_t operator()(const string& s)
    {
        // BKDR
        size_t value = 0;
        for (auto ch : s)
        {
            value *= 31;
            value += ch;
        }
        return value;
    }
};
struct APHash
{
    size_t operator()(const string& s)
    {
        size_t hash = 0;
        for (long i = 0; i < s.size(); i++)
        {
            if ((i & 1) == 0)
            {
                hash ^= ((hash << 7) ^ s[i] ^ (hash >> 3));
            }
            else
            {
                hash ^= (~((hash << 11) ^ s[i] ^ (hash >> 5)));
            }
        }
        return hash;
    }
};
struct DJBHash
{
    size_t operator()(const string& s)
    {
        size_t hash = 5381;
        for (auto ch : s)
        {
            hash += (hash << 5) + ch;
        }
        return hash;
    }
};
template<size_t N,
size_t X = 5,
class K = string,
class HashFunc1 = BKDRHash,
class HashFunc2 = APHash,
class HashFunc3 = DJBHash>
    class BloomFilter
    {
        public:
        void Set(const K& key)
        {
            size_t len = X*N;
            size_t index1 = HashFunc1()(key) % len;
            size_t index2 = HashFunc2()(key) % len;
            size_t index3 = HashFunc3()(key) % len;
            /* 	cout << index1 << endl;
				cout << index2 << endl;
				cout << index3 << endl<<endl;*/
            _bs.set(index1);
            _bs.set(index2);
            _bs.set(index3);
        }
        bool Test(const K& key)
        {
            size_t len = X*N;
            size_t index1 = HashFunc1()(key) % len;
            if (_bs.test(index1) == false)
                return false;
            size_t index2 = HashFunc2()(key) % len;
            if (_bs.test(index2) == false)
                return false;
            size_t index3 = HashFunc3()(key) % len;
            if (_bs.test(index3) == false)
                return false;
            return true;  // 存在误判的
        }
        // 不支持删除,删除可能会影响其他值。
        void Reset(const K& key);
        private:
        bitset<X*N> _bs;
    };

Bloom filter lookup

The idea of ​​the Bloom filter is to map an element into a bitmap with multiple hash functions, so the bit of the mapped position must be 1. So you can search in the following way: Calculate whether the bit position corresponding to each hash value is stored as zero. As long as one is zero, it means that the element must not be in the hash table, otherwise it may be in the hash table.
Note: If the Bloom filter says that an element does not exist, the element must not exist. If the element exists, the element may exist, because some hash functions have certain misjudgments.
For example: when searching in the Bloom filter "alibaba", suppose the hash value calculated by the three hash functions is: 1、3、7, which just overlaps with the bits of other elements. At this time, the Bloom filter tells that the element exists, but in fact the element is nonexistent.

Bloom filter removal

Bloom filters cannot directly support deletion, because when one element is deleted, other elements may be affected.

For example: to delete "tencent"an element in the above figure, if the binary bit position corresponding to the element is directly 0, “baidu”the element will also be deleted, because these two elements happen to overlap in bits calculated by multiple hash functions.

A method that supports deletion: expand each bit in the Bloom filter into a small counter, add one to k counters (hash addresses calculated by k hash functions) when inserting elements, and delete elements When , decrement the k counters by one, and increase the delete operation at the cost of occupying more storage space.

Graphic:

defect:

  1. Unable to confirm if element is actually in bloom filter
  2. presence count wrap

Advantages of Bloom filters

  1. The time complexity of adding and querying elements is: O(K), (K is the number of hash functions, generally relatively small), regardless of the size of the data
  2. Hash functions have nothing to do with each other, which is convenient for hardware parallel operation
  3. Bloom filter does not need to store the element itself, which has great advantages in some occasions with strict confidentiality requirements
  4. When able to withstand a certain amount of misjudgment, the Bloom filter has a great space advantage over other data structures
  5. When the amount of data is large, the Bloom filter can represent the full set, and other data structures cannot
  6. Bloom filters using the same set of hash functions can perform intersection, union, and difference operations

Bloom filter defect

  1. There is a false positive rate, that is, there is a false positive (False Position), that is, it is impossible to accurately determine whether the element is in the set (remedy: create a white list to store data that may be misjudged)
  2. cannot get the element itself
  3. Elements cannot be removed from bloom filters in general
  4. If counting is used to delete, there may be a problem of counting wrapping

Application of Bloom filter

Given two files, each with 10 billion queries, we only have 1G memory, how to find the intersection of the two files? give the exact algorithm

I calculated that 1G is about 1 billion bytes

Assuming each query is 30byte, how much space is needed for 10 billion queries? ->300 billion byte is approximately equal to 300G

Assume two files called A and B

Read the query in file A/B sequentially, i=Hash(query)% 1000, this query will enter the Ai/Bi small file

Then you can start to find the intersection of the same numbers, and put them in the two sets of memory.

Why is A0 and B0 looking for, A1 and B1 looking for, and the average cut doesn't work?

Because the same hash function is used, the same numbers in A and B must enter the small files with the same number.

Is it possible that a small file cannot be loaded into memory? A small file is too large, it is possible, you can repeat the above steps for the small file, and treat it as a sub-problem, but you need to change the hash function

hashing

Given a log file with a size of more than 100G, where IP addresses are stored in the log, design an algorithm to find the IP address with the most occurrences? The same conditions as the previous question, how to find the IP of top K?

Finding IPthe address with the most number of times is similar to the idea mentioned before

Read each IP, first cut into 500 parts, to calculate the corresponding hash value of each IP: i=Hash(ip)%500, this IP enters the first ismall file

Key point: the addresses of the same IP are all in one small file

Use sequentially map<string,int>to count the number of times for each small file

topk, build a <ip,count>small heap with K values

This article is published by mdnice multi-platform

Guess you like

Origin blog.csdn.net/m0_67759533/article/details/131854748