Simulation implementation and interview questions of bitmap and Bloom filter

bitmap

Simulation implementation

namespace yyq
{
    template<size_t N>
    class bitset
    {
        public:
        bitset()
        {
            _bits.resize(N / 8 + 1, 0);
            //_bits.resize((N >> 3) + 1, 0);
        }

        void set(size_t x)//将某位做标记
        {
            size_t i = x / 8; //第几个char对象
            size_t j = x % 8; //这个char对象的第几个比特位

            _bits[i] |= (1 << j); //标记
        }

        void reset(size_t x)//将某位去掉标记
        {
            size_t i = x / 8;
            size_t j = x % 8;

            _bits[i] &= (~(1 << j));
        }

        //测试值是否在
        bool test(size_t x)
        {
            size_t i = x / 8;
            size_t j = x % 8;

            return _bits[i] & (1 << j);//整型提升,bool是4字节,char是1字节,按符号位来补
        }

        private:
        std::vector<char> _bits;
    };
}

Of course, bitmap also has disadvantages, it can only handle integer data.

application

  1. Quickly find whether a certain data is in a collection
  2. Sort + deduplicate
  3. Find the intersection, union, etc. of two sets
  4. The disk block mark
    bitmap in the operating system uses a bit to identify the presence or absence of data (direct address method of hash). The advantage is space saving and high efficiency. The disadvantage is that it can only process integer data and requires relatively concentrated data. Combining a hash with a bitmap, i.e. a bloom filter.

A bitmap is to map a piece of data to a location through a hash function to determine whether it is there; a Bloom filter is to map a piece of data to multiple locations through multiple hash functions to reduce the rate of misjudgment and judge whether it is there or possibly at

bloom filter

The Bloom filter is a compact and clever probabilistic data structure proposed by Burton Howard Bloom in 1970. It is characterized by efficient insertion and query , and can be used to tell you "something must not Exist or may exist ", it uses multiple hash functions to map a piece of data into a bitmap structure. This method can not only improve query efficiency, but also save a lot of memory space.

Simulation implementation

The choice of the number of hash functions

The more the number of hash functions, the more bits the Bloom filter needs to open, and the larger the memory usage, the faster the bit position of the Bloom filter is set to 1, but the efficiency becomes lower; if the number is too small , the false positive rate will become higher.

k is the number of hash functions, m is the length of the bloom filter, n is the number of inserted elements, p is the false positive rate

The calculation formula is k = m / n ∗ ln ( 2 ) k = m / n * ln(2)k=m/nl n ( 2 )术m = − n ∗ ln ( p ) / ln 2 / ln 2 m = −n*ln(p) / ln2 / ln2m=nl n ( p ) / l n 2/ l n 2

The first formula gives m = k ∗ n / ln 2 m = k * n / ln2m=kn / l n 2 , when we use 3 hash functions, the length of the Bloom filter is3 ∗ n / ln 2 ≈ 4.33 n 3*n/ln2 ≈ 4.33n3n / l n 24.33 n

In the code, we directly take 5n, which is X == 5 in the code, which can be changed.

struct BKDRHashFunc
{
    size_t operator()(const std::string& key)
    {
        size_t hash = 0;
        for (auto ch : key)
        {
            hash *= 131;
            hash += ch;
        }
        return hash;
    }
};

struct APHashFunc
{
    size_t operator()(const std::string& key)
    {
        size_t hash = 0;
        const char* str = key.c_str();

        for (int i = 0; *str; i++)
        {
            if ((i & 1) == 0)
            {
                hash ^= ((hash << 7) ^ (*str++) ^ (hash >> 3));
            }
            else
            {
                hash ^= (~(hash << 11) ^ (*str++) ^ (hash >> 5));
            }
        }
        return hash;
    }
};

struct DJBHashFunc
{
    size_t operator()(const std::string& key)
    {
        size_t hash = 5381;
        const char* str = key.c_str();
        while (*str)
        {
            hash += (hash << 5) + (*str++);
        }
        return hash;
    }
};

// N是最多存储的数据个数
// 平均存储一个值,开辟X个位
template<size_t N, size_t X = 5, class K = std::string, class HashFunc1 = BKDRHashFunc, class HashFunc2 = APHashFunc, class HashFunc3 = DJBHashFunc>
class BloomFilter
{
    public:
    void set(const K& key)
    {
        //3个哈希函数映射
        size_t hashi1 = HashFunc1()(key) % (X * N);
        size_t hashi2 = HashFunc2()(key) % (X * N);
        size_t hashi3 = HashFunc3()(key) % (X * N);

        _bs.set(hashi1);
        _bs.set(hashi2);
        _bs.set(hashi3);
    }

    bool test(const K& key)
    {
        //3个哈希函数映射
        size_t hashi1 = HashFunc1()(key) % (X * N);
        if (!_bs.test(hashi1))
        {
            //如果通过一个映射值不在,那肯定不在
            return false;
        }

        size_t hashi2 = HashFunc2()(key) % (X * N);
        if (!_bs.test(hashi1))
        {
            //如果通过一个映射值不在,那肯定不在
            return false;
        }

        size_t hashi3 = HashFunc3()(key) % (X * N);
        if (!_bs.test(hashi1))
        {
            //如果通过一个映射值不在,那肯定不在
            return false;
        }

        //前三个映射值都存在,那么key可能在(有可能三个位置都冲突)
        return true;
    }
    private:
    std::bitset<N * X> _bs;
};

Test false positive rate

void test_bloomfilter2()
{
    srand(time(0));
    const size_t N = 100000;
    BloomFilter<N> bf;

    std::vector<std::string> v1;
    std::string url = "https://www.cnblogs.com/-clq/archive/2012/05/31/2528153.html";

    for (size_t i = 0; i < N; ++i)
    {
        v1.push_back(url + std::to_string(i));
    }

    for (auto& str : v1)
    {
        bf.set(str);
    }

    // v2跟v1是相似字符串集,但是不一样
    std::vector<std::string> v2;
    for (size_t i = 0; i < N; ++i)
    {
        std::string url = "https://www.cnblogs.com/-clq/archive/2012/05/31/2528153.html";
        url += std::to_string(999999 + i);
        v2.push_back(url);
    }

    size_t n2 = 0;
    for (auto& str : v2)
    {
        if (bf.test(str))
        {
            ++n2;
        }
    }
    std::cout << "相似字符串误判率:" << (double)n2 / (double)N << std::endl;

    // 不相似字符串集
    std::vector<std::string> v3;
    for (size_t i = 0; i < N; ++i)
    {
        std::string url = "zhihu.com";
        url += std::to_string(i + rand());
        v3.push_back(url);
    }

    size_t n3 = 0;
    for (auto& str : v3)
    {
        if (bf.test(str))
        {
            ++n3;
        }
    }
    std::cout << "不相似字符串误判率:" << (double)n3 / (double)N << std::endl;
}

does not support reset

Because a certain bit may be mapped by multiple values, there is a conflict. Resetting this bit may cause the real key to become absent.

interview questions

1. Given 10 billion integers, design an algorithm to find an integer that appears only once

What the bitmap needs to accomplish is presence or absence, only 2 states ==> 1 bit, and 8 bits of char can represent the state of 8 numbers. And this question requires 3 states (0: 00, 1: 01, n: 10) ==> 2 bits, and the 8 bits of char can represent the state of 4 numbers.

Open two bitmaps. The same position of the two bitmaps can be represented by 0 and 1. When this number appears for the first time, the corresponding position of the first bitmap is set to 1; for the second and above occurrences, the second Set the corresponding position of the bitmap to 1.

To filter integers that appear once, use 2 bitmaps; to filter integers that appear twice, use 3 bitmaps, and so on.

	template<size_t N>
	class twobitset
	{
	public:
		void set(size_t x)//将某位做标记
		{			
			if (!_bits1.test(x) && !_bits2.test(x))//00
			{
				_bits2.set(x);
			}
			else if (!_bits1.test(x) && _bits2.test(x))//01
			{
				_bits2.reset(x);
				_bits1.set(x); //10
			}
			else//10
			{
				//啥也不做
			}
		}

	private:
		std::bitset<N> _bits1;
		std::bitset<N> _bits2;
	};
}

2. Given two files, each with 10 billion integers, we only have 1G memory, how to find the intersection of the two files

For two files, use a bitmap for each file, and the corresponding functions of the bitmap include deduplication + intersection. Both bitmap positions are 1, which is the intersection of the two files.

3. Bitmap application deformation: 1 file has 10 billion ints, 1G memory, and an algorithm is designed to find all integers that appear no more than 2 times.

The maximum value of int is more than 2.4 billion. To find no more than two times, two bitmaps and four states (00\01\10\11) are used, and then the data corresponding to the two states of 00 and 11 must be filtered out

4. Given a log file with a size of more than 100G, where IP addresses are stored in the log, design an algorithm to find the IP address with the most occurrences?

The ip is a string like this 127.0.0.1. Bitmaps can only solve the K problem (in or not), not the KV problem (how many times). Here, the most frequent request can only be solved by using map. The size of 100G will definitely not fit into the memory. We use hash cutting to divide the file into 100 small files (note that it is not evenly divided), and each small file As a hash bucket, use the function to convert the ip into an integer, i = HashFunc(ip) % 100, the i conflicting ip will enter the corresponding file i, then the same type of ip will enter the same file (the same value will definitely enter the same file, Of course, there will also be hash conflict values), and then map the number of occurrences for each file.

If: a single small file exceeds 1G, it means that there are many conflicting ip in this small file, a. Most of them are different ip/b. Most of them are the same ip, how to deal with it?

a. Mostly it is the case of different IPs. It is definitely not possible to use map to make complete statistics. Change to a string hash conversion function, recursively and then split.

b. Most of them are the same ip, and the map can be used for statistics, and the worst case is to use external sorting.

If the insert of the map fails, it means that there is no memory, which is equivalent to the failure of the new node. If the new fails, an exception will be thrown, and it will be processed according to a.

5. Given two files (A, B), each with 10 billion queries, we only have 1G memory, how to find the intersection of the two files? The exact algorithm and the approximate algorithm are given respectively.

query is a query command, for example, it may be a web page request or a database sql statement.

Precise algorithm: Assuming that each query instruction is 50 bytes, the size of 10 billion queries is about 500GB. Divide these data into 1000 small files (Axx, Bxx), each file is about 0.5GB. Each small file passes through the same hash function, and most of the data in the corresponding numbered file is similar. The data is deduplicated, and then A01 and B01 use the hash table to find the intersection, ...A99 and B99 respectively find the intersection. If the small file exceeds 1GB, change the hash function and then split it.

Approximation algorithm: use the Bloom filter, first pass a file through the Bloom filter, and use another file to determine what is there.

6. How to extend BloomFilter so that it supports the operation of deleting elements

Counter, how many values ​​are mapped to this bit, this bit is a few, when a reset is required, the value of this position –. But to realize the counting function, the mapping position can no longer use a bit mark, but requires multiple bits to store the counting value, and the space consumption is doubled. Therefore, this solution will not be used in practice, it is better to use a hash table.

Guess you like

Origin blog.csdn.net/m0_61780496/article/details/129766307