[Data structure and algorithm] -> Data structure & algorithm -> Bitmap & Bloom filter -> How to achieve URL deduplication in web crawlers?

Bitmap & Bloom filter

Ⅰ Preface

Web crawler is a very important system in search engines, responsible for crawling billions or tens of billions of web pages. The working principle of the crawler is to parse the web links in the crawled pages, and then crawl the web pages corresponding to these links.

But the same webpage link may be included in multiple pages, which will lead to crawlers in the crawling process, so how do we avoid these repeated crawling?

The easiest way to think of is that we record the crawled web links (that is, URLs). Before crawling a new web page, we take its links and search in the list of crawled web links. If it exists, it means that the webpage has been crawled; if it does not exist, it means that the webpage has not been crawled, and you can continue to crawl it. After crawling this webpage, we will add the link of this webpage to the list of crawled webpage links.

The idea is very simple, but how do we record the web links that have been crawled? What kind of data structure is needed?

Ⅱ Algorithm analysis

The object to be dealt with in this problem is a web page link, that is, a URL. There are two operations that need to be supported, adding a URL and querying a URL. In addition to these two functional requirements, in terms of non-functionality, we also require the execution efficiency of these two operations to be as high as possible. In addition, because we are dealing with hundreds of millions of web links, the memory consumption will be very large, so in terms of storage efficiency, we must be as efficient as possible.

We can recall several basic data structures, such as dynamic data structures such as hash tables, red-black trees, and jump tables, which can support fast insertion and search of data, but in terms of memory consumption, can they meet our business needs? ?

Let's use a hash table as an example. Suppose we want to crawl 1 billion webpages. In order to judge the importance, we store these 1 billion webpage links in the hash table. Assuming that the average length of a URL is 64 bytes, simply storing these 1 billion URLs requires approximately 60 GB of memory space. Because the hash table must maintain a small load factor to ensure that there will not be too many hash conflicts, resulting in a decrease in the performance of the operation. Moreover, the hash table that uses the linked list method to resolve conflicts will also store the linked list pointer. Therefore, if these 1 billion URLs are constructed into a hash table, the memory space required will be much greater than 60GB, and may exceed 100GB.

Of course, for a large search engine, even the memory requirement of 100GB is not too high. We can adopt the idea of divide and conquer and use multiple machines, such as 20 machines with 8GB of memory, to store this 1 billion web links.

[Data structure and algorithm] -> algorithm -> divide and conquer algorithm -> the basic idea of MapReduce

Regarding the problem of crawler URL deduplication, the idea of the hash table we talked about can actually work. However, as a pursued engineer, we should consider whether there is room for further optimization in terms of adding and querying data efficiency and memory consumption?

At this point, you may be surprised. The time complexity of adding and finding data in the hash table is already O(1). Is there room for further optimization? In fact, time complexity does not completely represent the execution time of the code. The Big O time complexity representation method ignores constants, coefficients, and low-levels, and the object of statistics is the frequency of sentences. Different statements have different execution times. Time complexity only indicates the change trend of execution time with data scale, and cannot measure the code execution time under a specific data scale.

[Data structure and algorithm] -> Detailed explanation of time complexity and space complexity

[Data Structure and Algorithm] -> Data Structure -> Hash Table (on) -> Idea of Hash Table & Resolution of Hash Conflict

If the original coefficient in the time complexity is 10, we can now reduce the coefficient to 1 through optimization, and the execution efficiency will be increased by 10 times when the time complexity has not changed. For actual software development, a 10 times efficiency improvement is obviously a very worthwhile optimization.

If we use a linked list-based method to solve the conflict problem, the URL is stored in the hash table, then when querying, after locating a linked list through the hash function, we need to compare the URLs in each linked list in turn. This operation It is time-consuming, mainly for two reasons:

On the one hand, the nodes in the linked list are not continuously stored in the memory, so they cannot be loaded into the CPU cache all at once, and the CPU cache cannot be used well, so the data access performance will be compromised.
On the other hand, each data in the linked list is a URL, and URL is not a simple number, it is a string with an average length of 64 bytes. In other words, we need to match the URL to be judged to each URL in the linked list. Obviously, such a string matching operation is much slower than a simple number comparison. Therefore, based on these two points, there must be room for optimization in terms of execution efficiency.

For the optimization of memory consumption, in addition to the above-mentioned solution based on the hash table, we have to consider a deeper level. If you want a significant savings in memory, you have to change a solution, which is a storage structure that this article will talk about, Bloom Filter .

Before explaining the Bloom filter, let's take a look at another storage structure- BitMap (BitMap) .

Ⅲ bitmap

Let me give an example that is similar to what I said before, but a bit simpler. For example, we have 10 million positive numbers, and the range of integers is between 100 and 100 million. Then, how to quickly find out whether a certain integer is among these 10 million integers?

Of course, this problem can still be solved with a hash table. However, we can use a special hash table, which is a bitmap. We apply for an array with a size of 100 million and a data type of Boolean type (true or false), and we set the corresponding array value to true. For example, the array value corresponding to the integer 7 with subscript 7 is set to true, which isarray[7] = true

When we query whether a certain integer K is among these 10 million integers, we only need to take out the corresponding array value array[k] to see if it is equal to true. If it is equal, it means that the integer K is included in the 10 million integers. On the contrary, it means that the integer K is not included.

However, the Boolean type provided in many languages has a size of 1 byte, which does not save much memory space. In fact, to represent two values of true and false, we only need one binary bit (bit). How to represent a binary bit through a programming language?

This requires the use of bit operations. We can use the data types provided in the programming language, such as int, long, char, etc., to use bit operations to represent a certain number.

If you are interested, you can jump to see my article explaining bit operations below. It is implemented in C language, but the basic principle is the same, and the syntax is almost the same as Java. Compared with the C language, Java and Python have two symbols for right shift.

>>> means logical shift to the right. This is an operation on unsigned data.
>> Indicates that the arithmetic shifts to the right, this is the operation of the signed number data.

The rest of my article has been covered in detail.

[C language basics] -> Detailed analysis of bit operations -> Use of bit operations

Let's look at the code directly and follow the code to understand it.

package com.tyz.bloom_filter.core;

/**
 * 位图
 * @author Tong
 */
public class BitMap {
    
    
	private char[] bytes;
	private int nbits;
	
	public BitMap(int nbits) {
    
    
		this.nbits = nbits;
		this.bytes = new char[nbits/16+1]; //Java中的char类型占两个字节, 也就是16bit
	}
	
	/**
	 * 置位
	 * @param k
	 */
	public void set(int k) {
    
    
		if (k > this.nbits) {
    
     
			return;
		}
		int byteIndex = k / 16;
		int bitIndex = k % 16;
		this.bytes[byteIndex] |= (1 << bitIndex);
	}
	
	/**
	 * 取位
	 * @param k
	 * @return
	 */
	public boolean get(int k) {
    
    
		if (k > this.nbits) {
    
    
			return false;
		}
		int byteIndex = k / 16;
		int bitIndex = k % 16;
		
		return (this.bytes[byteIndex] & (1 << bitIndex)) != 0;
	}
}

This char type is actually equivalent to a container for storing data, and other data types, such as int, can also be used, so just change the 16 in the code to 32.

From the description of the bitmap above, you should be able to find that the bitmap locates the data through the array subscript, so the access efficiency is very high, and each number is represented by a binary bit. In the case of a small number range, so The required memory space is very low.

For example, in the previous example, if a hash table is used to store these 10 million data, the data is a 32-bit integer, which means 4 bytes of storage space is required, and a total of at least 40 MB of storage space is required. If we use bitmaps, the number range is between 100 and 100 million, and only 100 million binary bits are needed, which is about 12 MB of storage space.

Regarding the bitmap, we have an assumption that the range of the number is not very large. If the range of the number is very large, such as the previous question, it is not 100 to 100 million, but 100 to 1 billion, and the size of the bitmap is 1 billion A binary bit, that is, the size of 120 MB, compared with the 40 MB directly stored, the memory space consumed does not decrease but increases.

At this time, Bloom Filter is about to appear. Bloom filter is an improvement to the data structure of bitmap to solve this problem just now.

Ⅳ Bloom filter

Still in the previous example, the number of data is 10 million, and the range of data is 100 to 1 billion. The Bloom filter method is that we still use a 100 million binary-sized bitmap, and then use the hash function to process the number so that it falls in the range of 100 to 100 million. For example, we design the hash function as f(x) = x% n . Among them, x represents a number, and n represents the size of the bitmap (100 million), that is. Take the modulus of the number and the size of the bitmap to find the remainder.

Is this enough? Those two numbers, 1 and one hundred and one hundred and one hundred and one, have a hash value of 1, after this hash function. It is still indistinguishable whether the bitmap stores 1 or one hundred and one hundred and one.

In order to reduce the probability of such collisions, of course we can design a hash function with complex points and random points. Besides, are there other ways?

Let's look at the Bloom filter method. Since a hash function may have conflicts, can multiple hash functions locate a piece of data together, can the probability of conflict be reduced? Let me explain in detail how the Bloom filter is made.

We use the K hash function, the hash value is evaluated on the same number, it will be K different hash values, we were referred to as the X- ₁ , the X- ₂ , the X- ₃ , ..., the X- _K . We use these K numbers as subscripts in the bitmap, and set the corresponding BitMap[X ₁ ], BitMap[X ₂ ], BitMap[X ₃ ],..., BitMap[X _K ] to true, which means , We use K binary bits to represent the existence of a number.

When we want to query whether a certain number exists, we use the same K hash functions to calculate the hash value of this number, and get Y ₁ , Y ₂ , Y ₃ ,..., Y _{K respectively} . Let's look at K hash values. Whether the values in the corresponding bitmap are all true. If they are all true, it means that this number exists. If any of them is not true, it means that this number does not exist.

Insert picture description here
For two different numbers, after processing by a hash function, the same hash value may be generated. But after K hash functions are processed, the probability that K hash values are the same is very low. Although the probability of two digital hash collisions is reduced after using K hash functions, this processing method brings new problems, that is, it is easy to misjudge. Take the following example. On the left is to store 146, 196 into the bitmap, and on the right is to check whether the number 177 exists.

Insert picture description here
One characteristic of the misjudgment of the Bloom filter is that it will only misjudge the existing situation. If a certain number is judged to not exist through the Bloom filter, it means that the number really does not exist, and no misjudgment will occur; if a certain number is judged to exist through the Bloom filter, then it may be misjudged at this time, and may not does not exist. However, as long as we adjust the number of hash functions and the ratio between the size of the bitmap and the number of numbers to be stored, the probability of this misjudgment can be reduced to a very low level.

Although the Bloom filter will be misjudgmental, but this does not affect it to play a big role. Many scenarios have a certain tolerance for misjudgment. For example, we want to solve the problem of heavy judgment by crawlers. Even if a webpage that has not been crawled is mistakenly judged to have been crawled, it is not a major issue for search engines and is tolerable. After all, there are too many web pages, and search engines cannot crawl them 100%.

After understanding the Bloom filter, the problem of deduplication of our crawler web pages is very simple.

We use Bloom filters to record the web links that have been crawled. Assuming that there are 1 billion web pages that need to be judged, we can use a bitmap of 10 times the size to store it, that is, 10 billion binary bits, which is converted into Bytes, that is about 1.2GB. Before we used a hash table to judge the weight, at least 100GB of space was required. In comparison, the Bloom filter reduces the size of the storage space by a lot.

Then let's take a look again. Is it more efficient to use Bloom filters than hash tables in terms of execution efficiency?

Bloom filter uses multiple hash functions to process the same web page link. The CPU only needs to read the webpage link from the memory once and perform multiple hash calculations. In theory, this set of operations is CPU-intensive. In the hash table processing method, it is necessary to read multiple webpage links with the same hash value (hash conflict), and perform string matching with the webpage links to be judged respectively. This operation involves reading a lot of memory data, so it is memory intensive. We know that CPU calculations may be faster than memory accesses, so in theory, the bloom filter's way of judging the weight is faster.

Ⅴ Summary

Through the previous explanation, we know that the Bloom filter is very suitable for this kind of large-scale heavy judgment scene that does not require 100% accuracy and allows a small probability of misjudgment. In addition to the deduplication of crawler pages, there is also, for example, counting the number of UVs per day for a large website, that is, how many users visit the site every day, we can use the Bloom filter to deduplicate users who repeatedly visit.

Earlier we talked about the false positive rate of the Bloom filter, which is mainly related to the number of hash functions and the size of the bitmap. When we keep adding data to the Bloom filter, there are fewer and fewer positions in the bitmap that are not true, and the false positive rate is getting higher and higher. Therefore, we need to support the automatic expansion function for situations where we cannot know the number of data to be judged in advance.

When the ratio of the number of data to the size of the bitmap in the Bloom filter exceeds a certain threshold, we re-apply for a new bitmap, and the new data coming later will be placed in the new bitmap. However, if we want to determine whether a certain data already exists in the Bloom filter, we need to view multiple bitmaps, and the execution efficiency of the response will be reduced.

Bitmap and Bloom filters are widely used, and many programming languages have been implemented. For example, the BitSet class in Java is a bitmap, Redis also provides the BitMap bitmap class, and Google's Guava toolkit provides the implementation of BloomFilter.