Detailed explanation of c++---bitmap simulation implementation

Why are there bitmaps?

Through the previous study, you must have understood the principle of hash table. We use two different methods to simulate and implement hash table. The first method is closed hashing. This method is to create an array and store it through the array. The content of each node and each node also contains an enumeration variable to represent the state of the current node. The second method is to open a hash, which can also be called a hash bucket. This method is to combine the vector container and the linked list In general, besides storing variable data, each node also has a pointer variable to store the address of the next node of the current node. These two methods can store data, but both methods have a problem, they consume There is a lot of space. In addition to the own data to be stored, some other variables must be stored to maintain the current rules. Therefore, when the hash surface is used for some huge data but very simple requirements, it seems a bit overkill. Feeling, then we can use bitmaps to solve some situations where the data is very large but requires simplicity.

A question to understand bitmaps

Let's learn about the principles of bitmaps through an interview question: Given 4 billion non-repeating unsigned integers, without sorting them, and an unsigned integer, how to quickly determine whether a number is among these 4 billion numbers? Some students will definitely think of this when they see this. They will definitely think that they can use a hash table to solve the problem. Directly store all 4 billion data in the hash table and then directly use the find hash number to search when making judgments. , but is this really okay? Let’s do a simple calculation. Assume that the hash table is implemented by the hash bucket method. It takes 4 bytes to store an unsigned integer and 4 bytes to store a pointer, so storing a data hash bucket 8 bytes are needed. There are 4 billion data here, which means 32 billion bytes are needed. Then we can simply calculate how much 32 billion bytes is: we can see that 32 billion bytes are
Insert image description here
converted into approximately GB. It is 30 GB, and the computers we usually use are only 8GB or 16GB. Using a hash table to implement such a function requires about 30GB of space. Even if the program is written, our computer may not be able to run normally, and this Although the data in the question is huge, the purpose is very simple. We just want to know whether a piece of data exists. So is it meaningful for us to store his data? There are only two situations for a piece of data: yes or no, so is it meaningful for us to store his data itself? Can we use another thing to represent the state of the data? Of course it is possible. The computer we use is a two-level computer. One bit can have two situations: 1 or 0, so we can use one bit to represent the status of each data. 1 means data 0 means The data does not exist, so if you want to judge whether a data has appeared in 4 billion unsigned integers, you can use 4 billion bits to represent it. And the conversion of 4 billion bits is 0.5 G faster
Insert image description here
So facing the above problem, we can open up 4.2 billion bits here. The 0th bit indicates whether data 0 exists, the first bit indicates whether data 1 exists, and the 100th bit It can indicate whether the data 100 exists, so if you want to know which data among the 4 billion data exists or does not exist, you only need to traverse 40 pieces of data and change the 0 on the corresponding bit to 1. When making the final judgment, based on a simple The mapping relationship can solve this problem by corresponding data to bits to see if it is 0 or 1, so this is the principle of bitmap.

Simulation implementation bitmap

According to the above explanation, everyone knows that the method of bitmap is to open 4.2 billion bits, and the value of the corresponding bit is changed to 1 when the number exists by using the direct fixed value method. The memory occupied by this method is 512M. But there is a problem with this method. C++ does not support the use of bit segments to open up space, so we can only use the form of an array to open up space one by one with char or int as a unit, but there will be a problem: x Which char or int unit is the mapped value on? What bit is the value mapped by x in this char or int unit? Then the solution here is to divide 8 modulo 8, but there is still a problem here, how to make the value of this bit change from 0 to 1 while the values ​​of other bits remain unchanged? The answer is to add 1 to the bit, for example, the following picture: the
Insert image description here
current data situation is as follows: if we want to change the fifth bit from left to right from 0 to 1, we can make it bitwise or The following data is enough:
Insert image description here
This way, the specified bit can be changed from 0 to 1 while keeping other bits unchanged. This data can also be achieved by shifting 1 to the left by a certain distance. Then we have an idea. Bitmap can be simulated and implemented. First of all, bitmap is also a template. The template contains a non-type template parameter. The class contains three functions: set, reset, and test. The functions of these three functions are to change a certain bit in the bitmap. Change it to 1, change a certain bit in the bitmap to 0, and detect whether the data on a certain bit is 1. Then the bottom layer of the bitmap class here uses vector to simulate the implementation, and the element type of vector is char. , then the code here is as follows:

#include<iostream>
#include<vector>
using namespace std;
template<size_t n>
class bitset
{
    
    
public:
	void set(size_t x)
	{
    
    
	}
	void reset(size_t x)
	{
    
    

	}
	bool test(size_t x)
	{
    
    

	}
private:
	vector<char> ch;
};

set

First, implement the following set function. If you want to change a certain bit to 1, you must first determine which char unit the number is on. Then you can divide the value of x by 8 to get which char unit it is on. , the value of x is modulo 8 to get the value of x in which bit of the char unit, for example, the following code:

void set(size_t x)
{
    
    
	size_t i = x / 8;
	size_t j = x % 8;
}

After getting this, we can get the corresponding char unit in the form of an array, then shift 1 left to get the data to be used, and finally do bitwise AND. Then the code here is as follows:

void set(size_t x)
{
    
    
	size_t i = x / 8;
	size_t j = x % 8;
	ch[i] |= (1 << j);
}

Of course, some friends here must be thinking about why it is shifted to the left. Don't you need to make a comprehensive judgment based on whether your computer uses big-endian storage or little-endian storage? The answer is no. Remember, when we were learning the left shift and right shift operators, did we discuss big-endian storage or little-endian storage separately? It doesn’t seem right, right? We just said that left shift is to move the data to a higher position, and right shift is to move the data to a lower position. It has nothing to do with the big-end machine or the little-end machine! So I feel that it is not so much the first one. The first byte of the char unit represents the first data. It is better to say that the lowest bit of the first char unit represents the first data. Then the lowest bit of the second char unit represents the 9th data, so This can be achieved by writing it as above.

reset

reset is to change the 1 in the corresponding position to 0, so can this be implemented using the same idea as above, right? Then here, all the other bits except this bit can be bit by bit or increased by 1, and then for the specified The position is bitwise ORed up to 0, so the code here would be as follows:

void reset(size_t x)
{
    
    
	size_t i = x / 8;
	size_t j = x % 8;
	ch[i] &= ~(1 << j);
}

test

The test function is used to return whether the data at the specified position is 1. Then the same idea is used here to obtain the data at the position and then return the result. Then the code here is as follows:

bool test(size_t x)
{
    
    
	size_t i = x / 8;
	size_t j = x % 8;
	return ch[i] & (1 << j);
}

Constructor

The bitmap has a template, and the parameters of the template are non-template parameters, so we can determine how much space to open up based on this parameter. Then in the constructor, we can use this template parameter to expand the vector. It is better to say my data range. is 0-99, then you have to open up 100 bits of space, which is 100/8+1 char units (it doesn’t matter if you open a little more here), then the code here is as follows:

bitset()
{
    
    
	ch.resize(n / 8 + 1);
}

code testing

We can use the following code to test it:

#include"bitset.h"
int main()
{
    
    
	bitset<100> bs;
	bs.set(5);
	bs.set(10);
	bs.set(15);
	bs.set(85);
	if (bs.test(85))
	{
    
    
		cout << "存在" << endl;
	}
	else
	{
    
    
		cout << "不存在" << endl;
	}
	bs.reset(85);
	if (bs.test(85))
	{
    
    
		cout << "存在" << endl;
	}
	else
	{
    
    
		cout << "不存在" << endl;
	}

	return 0;
}

The running results of the code are as follows:
Insert image description here
in line with our expectations, then there is no problem with the implementation of the code here

Some questions about bitmaps

first question


Interview question: Given 10 billion integers, try to find an integer that appears only once Bits cannot be fully represented. For example, a piece of data may not appear, the number of occurrences is 1, or the number of occurrences is greater than 1. There are three situations here, and there is no way to express it with one bit, so if you have a friend, think about it. We can Can't two adjacent bits represent a piece of data? For example, if the range of data is 0~99, we will open up 200 bits. Each data has two bits to represent 4 situations, but this method is very troublesome to change the structure of the bitmap, so we Use another way of thinking to solve the problem here, create two bitmaps, the same position of the two bitmaps represents the same data: 00 means that the data appears only once, 01 means that the data appears once, and 11 means that the data appears more than once Second, then this idea does not require us to change the original bitmap structure, then the code here is as follows:

template<size_t N>
class twobitset
{
    
    
public:
	void set(size_t x)
	{
    
    
		if (!_bs1.test(x) && !_bs2.test(x)) // 00
		{
    
    
			_bs2.set(x); // 01
		}
		else if (!_bs1.test(x) && _bs2.test(x)) // 01
		{
    
    
			_bs1.set(x); 
			_bs2.reset(x); // 10
		}

		// 10 
	}

	void PirntOnce()
	{
    
    
		for (size_t i = 0; i < N; ++i)
		{
    
    
			if (!_bs1.test(i) && _bs2.test(i))
			{
    
    
				cout << i << endl;
			}
		}
		cout << endl;
	}

private:
	bitset<N> _bs1;
	bitset<N> _bs2;
};

We can use the following code to do a test:

void test_twobitset()
{
    
    
	twobitset<100> tbs;
	int a[] = {
    
     3, 5, 6, 7, 8, 9, 33, 55, 67, 3, 3, 3, 5, 9, 33 };
	for (auto e : a)
	{
    
    
		tbs.set(e);
	}
	tbs.PirntOnce();
}

The running results of the code are as follows:
Insert image description here
in line with our expectations, then this shows that our code is implemented correctly.

second question

Interview question: Given two files containing 10 billion integers each, we only have 1g of memory, how to find the intersection of the two files?
Our solution to this kind of problem is to create two bitmaps, each bitmap corresponds to the data of a file, as long as the data appears, set the data in the corresponding position to 1, and then traverse the data of the two files at the same time Two bitmaps, if the same position of the two bitmaps is one, it means that the data exists in both files, then this is the solution to this kind of problem.

third question

Interview question: Given a log file with a size of more than 100G, ip addresses are stored in the log, and an algorithm is designed to find the ip address with the most occurrences. This question involves counting the number of ip addresses with the most occurrences
, so bitmaps cannot be used method to solve it, and based on the previous study, we know that map can be used to count times, so map is used to solve this problem here, but the data here has 10 billion pieces of data, and our memory must not be able to fit it, because not only There is a lot of original data, and map's own data is still consumed. However, bitmaps cannot be used to store data here, because bitmaps can only determine whether the current data is present and cannot count the times, so here we can only use map to count times. , we cannot count 10 billion at once, but we can cut the 10 billion data into 100 pieces on average, so that we only count the number of occurrences in 1 piece of data and then compare the data with the most occurrences in the 100 pieces of data. Finally Solved, but the statistics here are inaccurate because the number of occurrences of the same IP in different files is not related, so the number of statistics is inaccurate, so we can use another method instead of using average segmentation. Hash segmentation, so that the data appearing in a file is either the same or hash conflict, so counting the data of a file can count all the times of a certain data in 100 G of data, but here A problem may arise because the map container itself has additional consumption, so when the data size of a file exceeds one G, the map may not be able to count, so at this time we will discuss the score. First of all, the file does exceed 1 G but the map can still count, because most of the data in this file is the same data. The other is that the current file exceeds one G, but the map cannot count. At this time, the insert function of the map will fail to insert. Because there is no memory, this is equivalent to the failure of the new node, and new will throw an exception. At this time, we need to use a hash function to recursively cut the small file again, and split the file larger than 1G into Smaller files can be counted in this way, so this is the idea of ​​Ontology, I hope everyone can understand.

Guess you like

Origin blog.csdn.net/qq_68695298/article/details/131464736