[C++] Application of hash -- bitmap

1. Introduction of bitmap

We introduce bitmaps through an interview question:

Given 4 billion non-repeating unsigned integers that have not been sorted, now given an unsigned integer, how to quickly determine whether a number is among the 4 billion numbers?

The conventional problem-solving ideas are sorting + dichotomy, or inserting data into unordered_map/unordered_set, and then searching; but these two methods do not work here, because the amount of data is too large to be stored in memory ;

1G space has about 1 billion bytes, here are 4 billion integers, each integer is 4 bytes, then a total of 16 billion bytes, converted to about 16G, and our memory space is generally 4G; if If we want to use sorting + binary, then we must open up a 16G integer array, which is obviously impossible; and if sorting + binary is not enough, the hash table will not work, because in each bucket in the hash table A pointer is also stored to point to the next node, which consumes more space.

The conventional way of thinking doesn’t work, so let’s change our way of thinking – the title only requires us to judge whether a number exists, and there are no other requirements, so we don’t need to store these numbers at all, we just need to mark them ; instead, we need to mark a number Only one bit is required . If the binary bit is 1, it means it exists, and if it is 0, it means it does not exist.

The so-called bitmap is to use bits to store a certain state, which is suitable for judging whether a certain data exists in a large amount of data; in fact, the bitmap is a deformation of the direct mapping method of the hash table.


Second, the realization of the bitmap

After having a specific idea, the realization of the bitmap becomes very simple; generally speaking, we only need to provide the following three interfaces for the bitmap:

  • set: used to set the bit position corresponding to a certain value to 1, that is, to mark (insert) data;
  • reset: used to set the bit position corresponding to a certain value to 0, that is, unmark (delete);
  • test: It is used to test whether the bit corresponding to a certain value is 1, that is, to search for data.

The code is implemented as follows:

#pragma once
#include <vector>
using std::vector;

namespace thj {
    
    
	template<size_t N>
	class bitset {
    
    
	public:
		bitset() {
    
    
			_bs.resize(N / 8 + 1, 0);
		}

		void set(size_t x) {
    
    
			size_t i = x / 8;
			size_t j = x % 8;
			_bs[i] |= (1 << j);
		}

		void reset(size_t x) {
    
    
			size_t i = x / 8;
			size_t j = x % 8;
			_bs[i] &= ~(1 << j);
		}

		bool test(size_t x) {
    
    
			size_t i = x / 8;
			size_t j = x % 8;
			return _bs[i] & (1 << j);
		}

	private:
		vector<char> _bs;
	};
}

Among them, the template parameter N is the range of the given data (note that N is not the number of data), because the smallest data type in C++ is char, which occupies a byte of space, and there are 8 bits in a byte bit, can identify 8 elements, so we can resize the vector to N/8+1 in the constructor , and add 1 here because the division in C++ is integer division, that is, directly discard the remainder, so we need to open up one more space for bytes.

Note: We can also define the data type of vector as int, so that we can resize to N/32+1 when we open up space.

For the set, reset and reset functions, the target value x/8 can get the subscript that x should be mapped to, that is, which char, x%8 can get the bit that x should be mapped to the subscript, and then Then set the corresponding bit position of the corresponding subscript to 1 or set to 0.image-20230411191429762

With the bitmap, we can solve the above interview questions – since the question only states that the data is an unsigned integer, but does not give a specific data range, we can define N as -1 (signed -1 is equal to the maximum unsigned value, refer to npos of string), then we only need to set these 4 billion elements in turn, and finally test the target element.

Note: The maximum value of an unsigned number is approximately equal to 4.29 billion, that is to say, a total of so many bits are needed for marking, and the conversion is about 500 million bytes, and 1G memory has about 1 billion bytes, so the bit The picture takes up about 512M of memory at most , which is what ordinary computers can do now.image-20230411192337426


Three, bitset

In fact, C++ also provides something similar to a bitmap, but C++ calls it a collection of bits – bitset . Its functions are more abundant than our own simulations, but the main functions such as set, reset and test are the same. of.image-20230411192807040

image-20230411192839772


Fourth, the application of bitmap

Bitmaps are mainly used in the following aspects:

  1. Quickly find out whether a certain data is in a collection;
  2. Sorting and deduplication;
  3. Find the intersection and union of two sets;
  4. Disk block marking in the operating system;

For quickly finding whether a certain data is in a set, we have given an example above, and here we give another variant of it:

Given 10 billion integers, design an algorithm to find the one that occurs only once?

We found that the use of traditional bitmaps cannot solve this problem, because bitmaps can only indicate presence or absence, and cannot indicate how many times a certain number appears; and bitmaps can only indicate presence or absence because a bitmap Data is represented by only one bit, and one bit can only identify two states, then we can combine the two bitmaps together, use two bits to identify one data , and two bits can identify four state, we can choose three of them:

  • 00: absent;
  • 01: appears once;
  • 10: Appears twice or more.

The code is implemented as follows:

namespace thj {
    
    
	template<size_t N>
	class two_bitset {
    
    
	public:
		void set(size_t x) {
    
    
			int flag1 = _bs1.test(x);
			int flag2 = _bs2.test(x);

			00:第一次插入,置为01
			if (!flag1 && !flag2)
			{
    
    
				_bs2.set(x);
			}

			//01:第二次插入,置为10
			else if (!flag1 && flag2)
			{
    
    
				_bs1.set(x);
				_bs2.reset(x);
			}
			//10:第三次及以后插入,不动
			else {
    
    }
		}

		bool test(size_t x) {
    
    
			//只有01表示只出现一次,返回true
			if (!_bs1.test(x) && _bs2.test(x)) return true;
			return false;
		}

	private:
		std::bitset<N> _bs1;
		std::bitset<N> _bs2;
	};
}

image-20230411201230965

Note: The title here only says to give 10 billion integers, but does not give the range of the data, so we still need to define the range of the bitmap as the maximum value of the unsigned number. The above N is given as 100 just for convenience test.

Apply rewarping to the bitmap:

1 file has 10 billion ints, 1G memory, design an algorithm to find all integers that appear no more than 2 times?

This question is actually the same as the idea of ​​finding the number that appears once, but here we need to mark the number of occurrences as 0, 1, 2, 3 and more, so we need to use the state 11 ; Here I will not give the code implementation, you can try to implement it yourself.


For sorting and deduplication, we can convert the data to be sorted into binary bits in a certain way (such as the division and modulus above), and then use a bitmap to represent these data; finally traverse the bitmap and convert all the binary bits that are 1 Bits can be converted into corresponding data output in the same way; at the same time, since the bitmap can only represent two states of existence and non-existence, the bitmap is inherently deduplicated.

Note: Bitmaps are suitable for scenarios where the data range is relatively concentrated. If the data range is scattered, you should consider using other data structures to implement sorting and deduplication functions, such as set and map.


For the intersection and union of two sets, we still use interview questions as an example:

Given two files, each with 10 billion integers, we only have 1G memory, how to find the intersection of the two files?

The idea of ​​this question is very simple. We can first map all the data in the first file to the bitmap, and then traverse and take out the data in the second file for testing, but this may get many duplicates. The result; so we can also map the data in the two files to two bitmaps, and then traverse to take out the data in a bitmap and test it with another bitmap.


For operating system disk block marking, in the file system in the operating system, the file system will divide the space on the disk into blocks of fixed size, and each block has a corresponding bitmap bit; A bit of 0 indicates that the block is free, and a bit of 1 indicates that the block has been allocated to a file or directory;

When the file system needs to allocate a new block, it can find the first bit of 0 in the bitmap, set it to 1, and allocate the block to the file; when the file system needs to release a block, it can set The bitmap bit corresponding to the block is set to 0, indicating that the block becomes a free block and can be reallocated to other files or directories.


5. Hash cutting

In addition to bitmaps, the following questions are often tested during interviews:

Given a log file with a size of more than 100G, where IP addresses are stored in the log, design an algorithm to find the IP address with the most occurrences?

Unlike the above question, this question cannot be solved using a bitmap, because we do not know how many times the same IP will appear at most, so we cannot determine how many bits are used to identify a piece of data;

So since 100G is too big to store in the memory, can we divide this file into 100 small files on average, so that each file is only 1G in size, and then put them into the map for statistics? The answer is no, because before counting the next small file, we need to count the statistical results of the previous file, that is, the data in the map. Otherwise, there may still be insufficient memory due to too much data stored in the map, but this will lead to The number of statistics is inaccurate, because we cannot guarantee that all the same IPs are divided into the same sub-file;

The correct solution is to perform hash cutting – first use the string hash function to convert the IP address into an integer, and then use the division and remainder method to divide the IP address in the 100G file into different small files:

size_t Ai = HashFunc(IP) % 100;  //100为小文件的个数

After hash cutting, the same IP will be divided into the same small file , because the same IP result string hash function converts the integer to the same, so the subscript position of the model is also the same; but different The IP of may also be divided into the same file, because hash collisions will occur; and there are two results of the division:

  1. There are many different IP addresses in the sub-file, but the size of the sub-file is about 1G, indicating that these IP addresses do not appear many times. At this time, we can directly use the map to count the number of these IP addresses; (all the same IP addresses must appear in the same subfile)
  2. There are many different IP addresses in the sub-file, but the sub-file is very large, which means that one of these IP addresses/a few IP addresses appear very many times. At this time, the map statistics cannot be counted. We can change a string . The Hash function continues to perform hash cutting on this sub-file, that is, the recursive sub-problem is solved .

The IP address with the most occurrences will be all mapped to a sub-file, and we can use the map to make statistics on the sub-file to get the number of occurrences.


Guess you like

Origin blog.csdn.net/m0_62391199/article/details/130094146