Bloom filter with the inverted index

Bloom filter

Bloom filter proposed:

For questions raised heavy push sent.
When the server records the history of all users seen when news recommendation system will recommend screening from the history of each user's filter out those records that already exist. How quickly find it?
Here are three ways:

  1. Record a hash table to store user disadvantage: a waste of space
  2. Bit map memory for recording user disadvantage: hash collision can not handle
  3. FIG binding bit hash, i.e., a Bloom filter
Bloom filter concept:

It is a compact, more clever probabilistic data structure.
Features:
Efficient insertion and query, because there is a certain probability model, you can only tell the user "something certainly does not exist or may exist."
Essence:
using a plurality of hash functions, to speak a bit map data structure in FIG.
Benefits:
improve query efficiency, save a lot of memory space.

Here Insert Picture Description

Bloom filter insert, and find defects:

An element with a plurality of hash functions mapped to a bitmap, and therefore bits are mapped to 1. The position
search method:
calculate a hash value corresponding to each bit position stored is zero, as long as a zero, not necessarily representing the elements of the hash table, otherwise it may in the hash table.

Insert and Tencent baidu
baidu set to 1:. 1. 4. 7
Tencent is set to 1: 348

If we insert new alibaba
alibaba is set to 1: 138
occurs at this time overlaps with the previously inserted data existing data.

If at this time to find baidu, only to find 147 position 1.
If at this time to find toutiao, toutiao opposed to a: 478
because 478 are set to 1 at this time, it displays the Bloom filter toutiao in the Bloom filter. It does not actually exist.
This is the defect of the Bloom filter.

Note:
Because there are some false positives and some hash function, so there will be unexpected: the Bloom filter when if an element is not present, this element must not exist.
If you say there is an element, the element may be present.
Here Insert Picture Description

Bloom filter is deleted:

FIG above Example:
If the request to delete tencent element directly if the 348-bit set to 0 directly tencent, baidu case elements tencent element overlapping on bit, so baidu element will be deleted. It can cause accidental deletion.

Supported deletion method:
each bit of the Bloom filter is expanded to a small counter, counter to the k-th (k hash functions calculated hash address) plus one insertion element, deleting an element, to the k counter is decremented, to increase the cost of deletion times occupied by a plurality of storage space.

Defect:
1 could not be confirmed in real elements in the Bloom filter
2. The present count rewind

Bloom filter advantages:
  1. Increasing the time and complexity of query elements: O (K), (K is the number of hash function, generally small), regardless of the size of data
  2. No relationship between the hash function to each other, parallel computing hardware to facilitate
  3. Bloom filter element itself need not be stored in some of the more stringent confidentiality requirements of the occasion has much to offer
  4. When able to withstand a certain degree of misjudgment, Bloom filters have this big space advantage over other data structures
  5. When the large amount of data, a Bloom filter can be expressed Collection, other data structures can not be
  6. Using the same set of hash functions can be cross Bloom filter, and the difference calculation
Bloom filter defects:
  1. There are false positives, that there is a false positive, that can not accurately determine whether the elements in the collection (Remedy: re-establishment of a white
    list, data storage may be a miscarriage of justice)
  2. You can not get the element itself
  3. Can not be deleted from the Bloom filter element generally
  4. If counting delete, there may be a problem counting wrap
#pragma	once

namespace bite
{
	template<size_t N, class T,
		               class HF1,
		               class HF2, 
		               class HF3, 
	                   class HF4, 
		               class HF5>

class BloomFilter
{
public:
	BloomFilter()
		:_size(0)
	{}
	void Insert(const T& data)
	{
		size_t index1 = HF1()(data) % N;
		size_t index2 = HF2()(data) % N;
		size_t index3 = HF3()(data) % N;
		size_t index4 = HF4()(data) % N;
		size_t index5 = HF5()(data) % N;

		_bs.set(index1);
		_bs.set(index2);
		_bs.set(index3);
		_bs.set(index4);
		_bs.set(index5);

		++_size;
	}

	bool IsIn(const T& data)
	{
		size_t index = HF1()(data);
		if (_bs.text(index))
			return false;

		size_t index = HF2()(data);
		if (_bs.text(index))
			return false;

		size_t index = HF3()(data);
		if (_bs.text(index))
			return false;

		size_t index = HF4()(data);
		if (_bs.text(index))
			return false;

		size_t index = HF5()(data);
		if (_bs.text(index))
			return false;
	}
private:
	bitset<N> _bs;
};
}
question:

1. to two files, there are 10 billion query, we only have 1G of memory, how to find the intersection of the two documents? And algorithms are given precise approximation algorithm.
10000000000 * 4 bytes = 40G

Exact algorithm: Ha Xiqie divided
two sub-Ha Xiqie files, using the same hash function (the hash function is attached five) to convert all query an integer hash. Reuse index (the file subscript) = hash% 1000 may be assigned to the same query with a file.
The two files in the same comparative index of small files, find their intersection.
The intersection is the intersection of the file that rolled the entire file.
Complexity of the algorithm is O (N).

Approximation Algorithm: Bloom filter
first uses the same hash function (the hash function is attached five) to convert all query an integer hash.
A first bit of a digital map file integer bit map, then let the second document and the first document is mapped to the comparison. Digital presence is the same intersection.
Complexity of the algorithm is O (N).

2. How to extend Bloom Filter to remove elements that it supports operations
for each bit reference count.

Hash function:

template<class T>
size_t BKDRHash(const T* str)
{
	register size_t hash = 0;
	while (size_t ch = (size_t)*str++)
	{
		hash = magic * 131+ ch;
	}
	return hash;
}

template<class T>
size_t SDBMHash(const  char* str)
	{
		register size_t hash = 0;
		while (size_t ch = (size_t)*str++)
		{
			hash = 65599 * hash + ch;
		}
		return hash;
	}

template<class T>
size_t RSHash(const T* str)
{
	register size_t hash = 0;
	size_t magic = 63689;
	while (size_t ch = (size_t)*str++)
	{
		hash = magic * hash + ch;
		magic *= 378551;
	}
	return hash;
}

template<class T>
size_t APHash(const T* str)
{
	register size_t hash = 0;
	size_t ch;
	for (long i = 0; ch = (size_t)*str++; i++)
	{
		if ((i & 1) == 0)
		{
			hash ^= ((hash << 7) ^ ch ^ (hash > >3));
		}
		else
		{
			hash ^=(~ ((hash << 11) ^ ch ^ (hash >> 5)));
		}
	}
	return hash;
}

template<class T>
size_t JSHash(const T* str)
{
	if (!*str)
		return 0;
	register size_t hash = 1315423911;
	while (size_t ch = (size_t)*str++)
	{
		hash ^= ((hash << 5) ^ ch ^ (hash >> 2));
	}
	return hash;
}

template<class T>
size_t DEKHash(const T* str)
{
	if (!*str)
		return 0;
	register size_t hash = 1315423911;
	while (size_t ch = (size_t)*str++)
	{
		hash ^= ((hash << 5) ^ ch ^ (hash >> 27))^ch;
	}
	return hash;
}

template<class T>
size_t FNVHash(const T* str)
{
	if (!*str)
		return 0;
	register size_t hash = 1315423911;
	while (size_t ch = (size_t)*str++)
	{
		hash *= 16777619;
		hash ^= ch
	}
	return hash;
}

Inverted index

concept:

Inverted index attribute value is not determined by the recording, but is determined by the attribute value of the position record.
Composition:
word dictionary and inverted file.
Inverted index is generally expressed as a keyword, then the (number of occurrences) its frequency, location (which appeared in an article or web page, and the relevant date, author and other information), which is equivalent to the Internet a few one hundred billion page made an index, like the table of contents of a book, the general label. Readers want to see a theme related chapters directly to find relevant pages according to directory. You do not have to book from the first page to the last page, page by page to find.

Is simply [articles - Keyword] reversed as [Keywords - Article].

In this analysis is not specific, detailed reference content

question:

1. to thousands of files, each file size of 1K-100M. To n words, design the algorithm finds all files that contain every word of it, you have only 100K of memory.

A: For thousands of files generated 1000 Bloom filter, and store 1000 Bloom filter a file, the memory is divided into two, one for reading Bloom filter word, for reading a document, until each Bloom filter read so far.

With a file info ready to save the file contains n words and their information is.

It is first divided into n words x parts. Each use of a bloom filter to generate (because only n words to generate a bloom filter, the memory may not be enough). All Bloom filter generates a file stored in the external memory of the Filter.

The buffer memory is divided into two, one for each read into a Bloom filter, for reading a file (read file using this buffer is equivalent to consumer issues Model with a producer to achieve synchronization), large files can be divided into smaller files, but need to store large files marked information (such as a small file which large files).

For each word read with a Bloom filter to determine whether the memory contains the value, if it does not, the next read from the Bloom filter Filter file into memory, up to and including all of the Bloom filter or traversed device. If included, the update info file. Until all the data processing. Filter delete files.

Da
Published 37 original articles · won praise 3 · Views 1101

Guess you like

Origin blog.csdn.net/weixin_43264873/article/details/102986769