C++ | Bitmaps and Bloom Filters

Table of contents

Preface

1. Bitmap

1. Introduction of bitmaps

2. Implementation of bitmap

(1) Basic structure

(2) Constructor

(3) Insert data

(4) Delete data

(5) Whether it exists

3. Advantages and disadvantages of bitmaps

4. Application of bitmap

2. Bloom filter

1. Introduction of Bloom filter

2. Implementation of Bloom filter

(1) Framework of Bloom filter

(2) Insertion of Bloom filter

(3) Searching for Bloom filters

(4) About deletion of Bloom filter

3. Advantages and disadvantages of Bloom filter

3. Hash cutting


Preface

        This article mainly explains bitmaps and bitmap applications, Bloom filters and their references, as well as our processing of massive data, and how to correctly understand the usage scenarios of bitmaps and Bloom filters;

1. Bitmap

1. Introduction of bitmaps

        First, let’s take a look at an interview question from Tencent;

        1. Give 4 billion unique unsigned integers, not sorted. Given an unsigned integer, how to quickly determine whether a number is among these 4 billion numbers.

If you encounter this problem, how would you deal with it?

Idea 1: Put the data into set/unordered_set;

        First, let's analyze it. There are 4 billion integers, each integer is 4 bytes, so a total of 16 billion bytes. How much memory is 16 billion bytes? We all know that 1KB = 1024Byte, 1MB = 1024KB, 1GB = 1024MB, so we can deduce 1GB = 1024*1024*1024Byte, which is about 1 billion bytes, so 16 billion bytes is approximately equal to 16GB of memory; this is only 16GB of data Memory, storing the corresponding pointers in our container also requires memory, which is obviously unreasonable; therefore, this solution is unreasonable ;

Idea 2: Previously we stored a piece of data that occupied 4 bytes. Can we use less memory to store this data? What is the minimum amount of memory we can use to store this data? Think about it, can we use one bit to store this data? At this time, we only need about 0.5GB to store these 4 billion data (it used to require an integer size of 32 bits, but now it only needs 1 bits, 16GB / 32 = 0.5GB ), we use the idea of ​​hash mapping, we map each data to a bit position, 1 means that the data exists, 0 means that the data does not exist, as shown in the figure below; we To store a 7, we just set the bit position corresponding to 7 to 1; this is the structure of the bitmap we will implement next;

2. Implementation of bitmap

(1) Basic structure

First we write the basic framework, as shown in the figure below;

	template<size_t N>
	class bitset
	{
	public:

	private:
		std::vector<char> _bits;
	};
(2) Constructor

The constructor of the bitmap is mainly to complete the initialization of the vector container. Let's think about how much space we need to open for a bitmap with N data. If N is 14, we need to open 2 bytes of space. Suitable, we need to open three bytes of space, that is, N / 8 + 1;

		// 构造
		bitset()
		{
			size_t len = N / 8 + 1;
			_bits.resize(len, 0);
		}
(3) Insert data

When we insert a piece of data into a bitmap, we set the corresponding bit position to 1. We can get which byte the data is in by changing the data x / 8. We can get which bit the data is in the byte by changing the data x % 8. bit, finally the corresponding bit position is set to 1 through bitwise OR operation and shift operator ;

		// 插入
		void set(size_t val)
		{
			// 第几个字节中
			size_t x = val / 8;
			// 字节的第几个比特位中
			size_t y = val % 8;
			// 置1
			_bits[x] |= (1 << y);
		}
(4) Delete data

When bitmap data is deleted, the corresponding bit position will be 0. If we want to set the corresponding bit position to zero, we also need to get the corresponding bit position. The bit byte and the position in the byte are obtained in the same way as insertion; Then the corresponding bit position is set to 0 through bitwise AND , shift operation and bitwise negation operation ;

		// 删除
		void reset(size_t val)
		{
			// 第几个字节中
			size_t x = val / 8;
			// 字节的第几个比特位中
			size_t y = val % 8;
			// 置0
			_bits[x] &= ~(1 << y);
		}
(5) Whether it exists

In addition to implementing insertion and deletion, we also need to provide an interface to test whether the data exists; we also use bitwise AND operations to check whether the bit data exists;

		// 在不在
		bool test(size_t val)
		{
			// 第几个字节中
			size_t x = val / 8;
			// 字节的第几个比特位中
			size_t y = val % 8;
            // 0是不存在,非0代表存在
			return _bits[x] & (1 << y);
		}

3. Advantages and disadvantages of bitmaps

The advantages and disadvantages of bitmaps are as follows:

advantage:

1. The search speed is fast, and you can find whether a data exists in O(1) time complexity;

2. Save memory space, 4 billion data only takes up 0.5G of memory

shortcoming:

1. Only applicable to integer storage

4. Application of bitmap

Bitmaps are mainly used in the following scenarios:

1. Quickly find whether a certain data is in a collection

2. Sorting + deduplication

3. Find the intersection and union of two sets, etc.

4. Disk block marking in the operating system

Next, I will use some interview questions to illustrate the application of bitmaps;

1. Given 10 billion integers, design an algorithm to find an integer that appears only once?

Idea: The question mentioned 10 billion data, but in fact there will be a lot of duplication, because the maximum value of the unsigned integer is only about 4.29 billion; we can design a class that has two bitmaps, each position You can use 00 to indicate that the data has not appeared once, 01 to indicate that the data has only appeared once, and 10 to indicate that the data has appeared twice or more; finally, traverse the entire bitmap and output the data mapped by the 01 bit. Does the data appear once?

	template<size_t N>
	class twobits1
	{
	public:
		void set(size_t val)
		{
			// 00 -> 01
			if (_bits1.test(val) == false && _bits2.test(val) == false)
			{
				_bits2.set(val);
			}// 01 -> 10
			else if (_bits1.test(val) == false && _bits2.test(val) == true)
			{
				_bits1.set(val);
				_bits2.reset(val);
			}
            // 两次及以上不必处理
		}

		void print()
		{
			for (size_t i = 0; i < N; i++)
			{
				if (_bits1.test(i) == false && _bits2.test(i) == true)
					std::cout << i << std::endl;
			}
		}
	private:
		bitset<N> _bits1;
		bitset<N> _bits2;
	};

2. Given two files, each with 10 billion integers, and we only have 1G of memory, how to find the intersection of the two files?

Idea 1: Just create a bitmap, traverse file 1, store it in the bitmap, and then traverse file 2. Compare each data with the bitmap to see if it exists. The existence is the intersection; there is a problem here, if file 1 The data of file 2 is 1 2 4, and the data of file 2 is 1 2 1 1 1. At this time, the output result of traversing file 2 will have a large number of repeated 1s, so we also need to remove duplicates. We can traverse file 2 at the same time. When we find the intersection in the bitmap once, we set the bit position in the bitmap to 0. At this time, in the subsequent search traversal, this data will not be output repeatedly, and the deduplication effect can be achieved;

Idea 2: Create two bitmaps, traverse the two files respectively, store file 1 in bitmap 1, and store file 2 in bitmap 2, and then traverse from 0 to the maximum unsigned integer value, if corresponding When the bits are all 1, it is the intersection of the data of the two files;

3. A file has 10 billion ints and 1G of memory. Design an algorithm to find all integers that appear no more than 2 times.

Idea: Same as question 1, we design a class with two bitmaps, and define the status 00 to indicate that the location has 0 mapping data, 01 to indicate that the location has 1 mapping data, and 10 to indicate that the location has two mapping data. 11 means that the position mapping data is 2 or more. After traversing the file and storing it in the bitmap, we traverse from 0 to the maximum value of the unsigned integer, and then output the value of the position mapping with status 01 or 10; this should be done later. The code for question 1 can be implemented. The specific code is as follows;

	template<size_t N>
	class twobits2
	{
	public:
		void set(size_t val)
		{
			// 00 -> 01
			if (_bits1.test(val) == false && _bits2.test(val) == false)
			{
				_bits2.set(val);
			} // 01 -> 10
			else if (_bits1.test(val) == false && _bits2.test(val) == true)
			{
				_bits1.set(val);
				_bits2.reset(val);
			} // 10 -> 11
			else if(_bits1.test(val) == true && _bits2.test(val) == false)
			{
				_bits1.set(val);
				_bits2.set(val);
			}
		}

		void print()
		{
			for (size_t i = 0; i < N; i++)
			{
				if ((_bits1.test(i) == false && _bits2.test(i) == true)
					|| (_bits1.test(i) == true && _bits2.test(i) == false))
					std::cout << i << std::endl;
			}
		}
	private:
		bitset<N> _bits1;
		bitset<N> _bits2;
	};

2. Bloom filter

1. Introduction of Bloom filter

        First, let’s look at a classic interview question;

1. Given two files, each with 10 billion queries, and we only have 1G of memory, how to find the intersection of the two files?
Exact algorithms and approximate algorithms are given respectively.

Idea: At this time, we found that we cannot solve this problem with our bitmap, because the bitmap is only valid for data of the integer family. At this time, we introduce a new data structure ----- Bloom filter, The idea is to convert the string into our integer type through the hash function; then store the query in one file into our bitmap, and query the other file in the Bloom filter. This is the main purpose of the Bloom filter. The idea is, hash function + bitmap ; it is also the approximate algorithm for our question;

String to integer hash function

2. Implementation of Bloom filter

Although we use some hash functions to map strings into our integer types, these hash functions can reduce collisions, but they cannot completely avoid them. In order to further reduce collisions, we map a string to multiple hash functions through multiple hash functions. location to achieve the effect of reducing conflicts; as shown in the following example;

The picture above uses two hash functions to map one of our strings to two locations. Of course, when we search, both mapped locations must exist to indicate that this string may exist;

(1) Framework of Bloom filter

        There is another question about the Bloom filter, which is how many hash function mappings there should be. This is actually a mathematical problem. Those who are interested in learning more can read the article linked below;

An in-depth study of bloom filter hash functions

        In the above formula, k is the number of hash functions, m is the Bloom filter length, and n is the number of data; we derive m = (k / ln 2) * n through the above formula; we write the following code structure;

	struct BKDRHash
	{
		size_t operator()(const std::string& s)
		{
			size_t hash = 0;
			for(auto ch : s)
			{
				hash = hash * 131 + ch;      
			}
			return hash;
		}
	};

	struct APHash
	{
		size_t operator()(const std::string& s)
		{
			size_t hash = 0;
			for (long i = 0; i < s.size(); i++)
			{
				size_t ch = s[i];
				if ((i & 1) == 0)
				{
					hash ^= ((hash << 7) ^ ch ^ (hash >> 3));
				}
				else
				{
					hash ^= (~((hash << 11) ^ ch ^ (hash >> 5)));
				}
			}
			return hash;
		}
	};

	struct DJBHash
	{
		size_t operator()(const std::string& s)
		{
			if (s.size() == 0)
				return 0;
			register size_t hash = 5381;
			for(auto ch : s)
			{
				hash += (hash << 5) + ch;
			}
			return hash;
		}
	};

	template<size_t N, 
		class K = std::string, 
		class HashFunc1 = BKDRHash,
		class HashFunc2 = APHash,
		class HashFunc3 = DJBHash>
	class BloomFilter
	{
	public:

	private:
		static const int _a = 4;  // k / ln 2
		bitset<N * _a> _bits;
	};

        This article selects three hash functions. Readers do not have to choose these three hash functions. They can choose from the previous string to integer hash functions. This article also calculated the value of k / ln 2 and named it a;

(2) Insertion of Bloom filter

        We calculate our hash separately, then take the remainder at the corresponding position, and then insert it;

		void set(const K& key)
		{
			int len = _a * N;
			size_t hashi1 = HashFunc1()(key) % len;
			size_t hashi2 = HashFunc2()(key) % len;
			size_t hashi3 = HashFunc3()(key) % len;

			_bits.set(hashi1);
			_bits.set(hashi2);
			_bits.set(hashi3);
		}
(3) Searching for Bloom filters

        The search of the Bloom filter mainly searches whether the hashes corresponding to the key are all mapped to the bitmap; the specific code is as follows;

		bool test(const K& key)
		{
			int len = _a * N;
			size_t hashi1 = HashFunc1()(key) % len;
			size_t hashi2 = HashFunc2()(key) % len;
			size_t hashi3 = HashFunc3()(key) % len;

			if (_bits.test(hashi1) && _bits.test(hashi2) && _bits.test(hashi3))
			{
				return true;
			}
			else
			{
				return false;
			}
		}

        In addition, we also need to pay attention to the fact that the search of the Bloom filter is inaccurate in the case, but it is completely accurate in the case where it is not; why? There is an explanation in the picture below;

        If the creature does not exist, but the mapped position is the position of another string map (because the hash function cannot completely guarantee no conflict), then the search result shows that the creature exists in the Bloom filter; therefore, it may not be Accurate; it is not necessarily accurate, but if it has been inserted before, its mapping position will be set to 1. If the specified position is not set to 1, it will definitely not exist; 

(4) About deletion of Bloom filter

        Bloom filters generally do not provide a deletion interface, as shown in the figure below;

        If the Bloom filter provides a deletion interface, and if a language is deleted, the corresponding bit of the language is set to 0. At this time, when we search for English and mathematics, we will find that we cannot find it;

        If you must provide a deletion interface, what can you do?

Idea: We use multiple bits to store each mapped position instead of just one. Suppose four bits are used to represent a position, and then we use the idea of ​​reference counting. Every time a hash map is inserted, we ++. For example, four bits can represent 15, which is 1111; when deleting a hash map position, we must operate -1 on it; in this way, the deletion interface of the Bloom filter can be realized, and deleting one data will not affect others. has an impact on the data; however, this method has disadvantages. One is that it takes up more space, and the other is that the reference counting data range is limited;

3. Advantages and disadvantages of Bloom filter

The advantages and disadvantages of Bloom filters are as follows:

advantage:

1. The time complexity of adding and searching is O(K) (K is the number of hash functions, which is generally very small);

2. Bloom filters do not need to store the elements themselves, and have data security for certain occasions;

3. It can store large data groups;

4. When small-scale misjudgments are allowed, the Bloom filter has a great space advantage in storing data.

shortcoming:

1. There is a misjudgment rate (it may be inaccurate).

2. Unable to obtain the element itself

3. Generally, elements of Bloom filters cannot be deleted.

4. Even if reference counting is provided to implement deletion, counting wraparound problems may occur (unsigned int nature)

        Below we also provide a test interface to test the accuracy of the Bloom filter; we can control the false positive rate by adjusting _a;

void test_bloomfilter()
{
	srand((unsigned int)time(0));
	const int N = 1000000;
	MySpace::BloomFilter<N> bf;
	string str1 = "https://mp.csdn.net/mp_blog/creation/editor/132078512?spm=1001.2014.3001.5352";
	// 源字符串
	vector<string> v1;
	for (int i = 0; i < N; i++)
	{
		v1.push_back(str1 + to_string(i));
	}
	for (auto& e : v1)
	{
		bf.set(e);
	}

	// 近似字符串
	vector<string> v2;
	for (int i = 0; i < N; i++)
	{
		v2.push_back(str1 + to_string(999999 + rand()));
	}
	size_t count1 = 0;
	for (auto& e : v2)
	{
		if (bf.test(e))
		{
			count1++;
		}
	}
	// 不近似字符串
	vector<string> v3;
	string str2 = "www.baidu.com";
	for (int i = 0; i < N; i++)
	{
		v3.push_back(str2 + to_string(rand() + i));
	}
	size_t count2 = 0;
	for (auto& e : v3)
	{
		if (bf.test(e))
		{
			count2++;
		}
	}
	cout << count1 << endl;
	cout << count2 << endl;
	cout << "近似字符串的重复率:" << (double)count1 / (double)N << endl;
	cout << "不近似字符串的重复率:" << (double)count2 / (double)N << endl;

}

3. Hash cutting

        Hash cutting is a supplementary topic in this chapter. Regarding hash cutting, first read the following interview questions;

1. Given a log file with a size of more than 100G, and IP addresses are stored in the log, design an algorithm to find the IP address that appears the most?
The same conditions as the previous question, how to find the IP of top K? How to implement it directly using Linux system commands?

Idea: First of all, when we look at the topic, we are asked to find the IP address that appears the most. We first think that this is a classic TopK problem; we can find it through the heap, but the heap cannot store so much data. How to deal with it? We can divide the data into 500 files through the hash function, and then use map/unordered_map to get the IP with the most occurrences in each file and record it. However, we are dividing it through the hash function, and we cannot guarantee that each The file sizes are consistent, but some files may still be too large;

For small files that are too large, there are two situations:

1. Small files have many duplicate IPs;

2. There are not many duplicate IPs in small files, just the hash cutting happens to be cut together;

How do we distinguish between these two situations? We can insert this small file into map/unordered_map. If the small file has many duplicate IPs, we can count the number of IP files, because there are many that will fail to insert. If it is case 2, during our insertion process, it will definitely throw Bad_alloc exception; At this time, we can change the hash function for this small file and continue to segment it; in the end, each small file will definitely be able to count an IP with the most occurrences, and we will put these IPs in the form of pair<IP, times> Put it into a large pile (compare according to the number of times), and finally you can get the IP address with the most occurrences;

Guess you like

Origin blog.csdn.net/Nice_W/article/details/132078512