How do you deal with large data files? Let's take a look at bitmaps and Bloom filters (Part 1)

Table of contents

Preamble

First, why do bitmaps and Bloom filters appear?

Two, bitmap/BitSet

2.1 What is a bitmap

2.2 Bitmap implementation

Three, the advantages and disadvantages of bitmap

3.1 Advantages of bitmaps

3.2 Disadvantages of bitmaps

Summarize


Preamble

In our daily life, the processing of big data is essential. For example, when registering a game name, it needs to be compared with the database to confirm whether it exists. JAVA has paging processing and database caching mechanisms, etc., while the answer sheet we handed in in C++ is based on Bitmaps and Bloom filters based on the Greek concept are used to process big data searches. Of course, this is just one of them. This time we will mainly explain the concept and implementation of bitmaps. We will talk about Bloom filters later.

First, why do bitmaps and Bloom filters appear?

In real life, the processing of big data is unavoidable. In our data structure, the unordered_set/unordered_map based on the hash table is an extremely efficient data structure, but when faced with large files such as several When a g or more than a dozen g cannot be loaded into the memory, there will be a shortage of writing. In order to deal with this situation, the bitmap and Bloom filter that inherit the hash idea are born.

Two, bitmap/BitSet

2.1 What is a bitmap

Before officially introducing the bitmap, let's take a look at a classic interview question from Tencent

 

 As shown in the picture above, what will the students do when they see it? There are two common ideas

1. Traverse, time complexity O(N)

2. Sort, then use binary search, time complexity O(NlogN+logN)

 The above two methods are very effective in dealing with the normal judgment of whether there is a problem, but there are 4 billion integers in this question, 16 billion bytes, which is approximately equal to 16G. Generally speaking, it cannot be loaded into the memory at all, so how to deal with it? Here we can solve it like this

3. Bitmap solution

We can use 0, 1 to represent whether an integer data exists, so an integer can be mapped to a bit, as shown in the figure below

 So the concept of our bitmap is as follows

The so-called bitmap is to use each bit to store a certain state, which is suitable for scenarios where there is a large amount of data and the data is not repeated. It is usually used
to judge whether a certain data exists or not

2.2 Bitmap implementation

Here we will implement a bitmap structure with three basic functions: set (insert), reset (delete), test (find)

First of all, our bitmap is one bit corresponding to one data, so when we open space, if there are N data, we need to open N bit space, which greatly saves space. We originally read 4 billion The integer needs 16G of memory, now open up 4 billion bits of space, we only need 500M, which greatly saves space

Secondly, our built-in types are char - 1 byte - 8 bits, int - 4 bytes - 12 bits, here we choose char as the basic unit, so that it is more convenient for us to locate the data

Finally, to find the location where the data is mapped, we can use the following formula

//定位到第i个char
int i = N / 8
//定位到第i个char的第j个位
int j = n % 8

After locating the position, we use bitwise & (the comparison complement corresponds to 0 if the binary bit is 0, and 1 if both are 1), and bitwise | (comparison complement, as long as the corresponding binary bit has 1, it is 1, both It is 0, it is 0) Waiting operation completes data insertion, deletion, search

Based on the above ideas, we drew a picture as follows

 

The implementation code is as follows

	template <size_t N>//N为要数据量
	class BitSet
	{
	public:
		//构造函数
		BitSet()
		{
			_bs.resize(N / 8 + 1, 0);//这里多开一个char的原因是
			//因为/号可能省去了余数,因此需要多开一个
			//如10/8=1,但实际上还有2个数据,因此需要多开一个
		}

		//插入
		void set(size_t number)
		{
			size_t i = number / 8;
			size_t j = number % 8;
			//插入
			_bs[i] |= (1 << j);
		}

		//删除
		void reset(size_t number)
		{
			size_t i = number / 8;
			size_t j = number % 8;
			//删除
			_bs[i] &= ~(1 << j);
		}

		//查找
		bool test(size_t number)
		{
			size_t i = number / 8;
			size_t j = number % 8;
			return _bs[i] & (1 << j);
		}

	private:
		vector<char> _bs;
	};

We can test it with a test example

void test_bitset1()
	{
		BitSet<100> bs;
		bs.set(10);
		bs.set(11);
		bs.set(15);
		cout << bs.test(10) << endl;
		cout << bs.test(15) << endl;

		bs.reset(10);

		cout << bs.test(10) << endl;
		cout << bs.test(15) << endl;

		bs.reset(10);
		bs.reset(15);

		cout << bs.test(10) << endl;
		cout << bs.test(15) << endl;
	}

works perfectly

Three, the advantages and disadvantages of bitmap

3.1 Advantages of bitmaps

1. Search according to the mapping relationship, the time complexity is bit (O(1)), and the efficiency is high

2. Compared with the space used for directly loading integer data, the bitmap greatly saves space

3.2 Disadvantages of bitmaps

1. Only integer data can be mapped, and nothing can be done for other strings, etc.

Summarize

The above is all the content of the bitmap in this section. The advantages of the bitmap are very powerful, but the disadvantages are also obvious, and it cannot be applied to other aspects such as characters. So is there a solution for this? Of course there is. At this time, our Bloom filter is about to debut. Please look forward to it, veterans

Guess you like

Origin blog.csdn.net/zcxmjw/article/details/131002473