[Original] Teach you to write web crawler (7): URL deduplication

 

Teach you how to write web crawler (7)

Author: Takumi

Summary: Writing a Crawler from Scratch, A Crash Guide for Beginners!

Cover:

 

In this issue, let's talk about URL de-emphasis. In the past, we have used Python's dictionary to save the crawled URLs, the purpose is to remove the repeated crawling URLs and avoid crawling the same web page multiple times. The crawler will put the URL to be crawled in the todo queue, and extract new URLs from the crawled web pages. Before they are put into the queue, it must first determine whether these new URLs have been crawled. Once it has been fetched, it will no longer be placed in the queue.

Different from stand-alone systems, in distributed systems, these URLs should be stored in a public cache so that multiple crawler instances can share them. We continue to use Redis to cache these data. The URL can be stored in the Redis Set data structure, or the URL can be stored as the Redis String type as the Key. As for the advantages and disadvantages of each of these two schemes, it is left to the reader to think for himself.

 

Store URL directly

Store the URL directly into memory as a string. Conservatively estimate that the average length of a URL is 100 bytes, then the memory occupied by 100 million URLs is: 100000000 * 0.0001MB = 10000MB, which is approximately equal to 10G. This is not unusable, no matter how much space is occupied, it can be solved by expansion.

The question is, what if a server can't hold so many URLs? In fact, it is also simple. The division of labor of each server is clear. That is to say, when you get a URL, you know which server to store. Each server only stores one type of URL. The simpler implementation method is to hash the URL first and then take the modulo. . Although it can be used, there is still a lot of room for optimization.

 

Store message digests

MD5 is a message digest algorithm, it has a wide range of uses, we use it here to compress URLs.

The characteristics of the message digest algorithm:

  1. No matter how long the input message is, the length of the calculated message digest is always fixed.
  2. As long as the input messages are different, digested messages must also be different; but the same input must produce the same output.
  3. Message digests are one-way and irreversible. Only forward information digests can be performed, and no original message can be recovered from the digests.

The above features show that we can achieve the deduplication function by storing the MD5 of the URL, because different URLs have different MD5s, and the same URLs have the same MD5.

The URL we want to save before is like this: http://news.baidu.com/ns?ct=1&rn=20&ie=utf-8&bs=%E4%BA%AC%E4%B8%9C%E9%87%91 %E8%9E%8D&rsv_bp=1&sr=0&cl=2&f=8&prevct=no&tn=news&word=%E4%BA%AC%E4%B8%9C%E9%87%91%E8%9E%8D&rsv_sug3=1&rsv_sug4=6&rsv_sug1=1&rsv_sug= 1

The corresponding MD5 value is this: d552b0b40e21d06d73a1a0938635eb1a

how about it? Save a lot of space, right?

Some people say, Tuohai, don't lie to me, the input of this algorithm is an infinite set, and the output is a finite set, there will inevitably be collisions, that is, there are different URLs to calculate the same MD5. This will lead to misjudgment during deduplication, and less data will be captured!

Well, in theory, this is bound to happen. But what is the probability of this happening? Let's calculate the probability that two different URLs produce the same message digest.

The following are three common message digest algorithms, which are 32, 64, and 128 bytes, each byte is a character of a hexadecimal number, and the number of possible values ​​for them are:

md5: 16^32 = 2^128 = 3.4 * 10^38

sha256: 16^64  = 2^256 = 1.2 × 10^77

sha512: 16^128 = 2^512 = 1.3 x 10^154

You might say, I'm number blind, I don't know how big the number is. Well, I found two intuitive references for you:

 

Number of IPv6 encoded addresses: 2^128 (about 3.4×10^38)

IPv6 is a next-generation IP protocol designed by the IETF to replace the current version of the IP protocol (IPv4). It claims to be able to compile a website for every grain of sand in the world.

 

Total number of atoms in the observable universe: 10^80

The picture above shows the Hubble telescope aimed at a specific area on the celestial sphere (equivalent to 1/12,700,000 of the entire celestial sphere area) for a long-term image capture, and finally found about 10,000 galaxies in this area. In this way, it is reasonable to speculate that there are currently 1.27x10^11 galaxies in the universe that we can observe with astronomical telescopes.

The currently generally accepted order of magnitude for the number of stars in a galaxy (ignoring planets) is 4x10^11.

The mass of a star like the sun is 1.96x10^30kg.

This calculates the total mass of the universe to be 9.96x10^55kg.

The mass of a hydrogen atom is 1.66x10^-24g.

Dividing the mass of the universe by the mass of a hydrogen atom gives the approximate number of atoms in the observable universe at 10^80.

It can be seen that the possibility of different URLs generating the same message digest is very small, which is like looking for a needle in a haystack. . . No, it's as hard as fishing for atoms in the universe, so just use it with confidence.

The message digest realizes the compression of the URL, but the compressed size is still in the same order of magnitude as the original, and the space efficiency does not improve qualitatively. Is there a way to uniquely identify a URL with only a few bits? have! Bloom filters are designed to solve this kind of problem.

 

Bloom filter

Bloom Filter is a random data structure with high space efficiency. It uses a bit array to represent a collection succinctly and can determine whether an element belongs to this collection. This efficiency of Bloom Filter has a certain cost: when judging whether an element belongs to a certain set, it may mistake elements that do not belong to this set as belonging to this set (false positive). Therefore, Bloom Filter is not suitable for those "zero error" applications. In applications where a low error rate can be tolerated, Bloom Filter trades very few errors for great savings in storage space.

Its principle is very simple. First, it requires a quasi-bit array (all bits are initialized to 0) and k independent hash functions. Set the bit array of the value corresponding to the hash function to 1. When searching, if all the corresponding bits of the hash function are found to be 1, it means that it exists, otherwise it does not exist. Obviously this process does not guarantee that the results of the lookup are 100% correct.

How to determine the size of the bit array m and the number of hash functions according to the number of input elements n? When the number of hash functions k=(ln2)*(m/n), the error rate is the smallest. In the case where the error rate is not greater than E, m must be at least equal to n*lg(1/E) to represent any set of n elements. But m should be larger, because at least half of the bit array must be 0, then m should be >=nlg(1/E)*lge, which is about 1.44 times of nlg(1/E). Assuming the error rate is 0.01, then m should be 13 times n at this time. So k is about 8.

Google's Guava base library has the implementation of the Bloom filter, which is very concise and in-depth. Let's learn this java code together.

 1 public <T> boolean put(T object, Funnel<? super T> funnel, int numHashFunctions, BitArray bits) {
 2     long bitSize = bits.bitSize();
 3     long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
 4     int hash1 = (int) hash64;
 5     int hash2 = (int) (hash64 >>> 32);
 6 
 7     boolean bitsChanged = false;
 8     for (int i = 1; i <= numHashFunctions; i++) {
 9         int combinedHash = hash1 + (i * hash2);
10         // Flip all the bits if it's negative (guaranteed positive number)
11         if (combinedHash < 0) {
12             combinedHash = ~combinedHash;
13         }
14         bitsChanged |= bits.set(combinedHash % bitSize);
15     }
16     return bitsChanged;
17 }

 

The function of the 01 function is to save a piece of data into BitArray after hashing. If the BitArray changes, it returns true, otherwise it returns false. The parameters are data, the number of hash functions, and the BitArray address.

03  Use murmur3 to hash out a long value. Why is one, shouldn't it be numHashFunctions? Please look down.

04 05  Cut hash64 in half and become hash1 and hash2.

08 09  The point is here, the numHashFunctions hash function originally came like this: hash1+(i*hash2). Excuse me? Is this operation too casual? Don't worry, see "Less Hashing, Same Performance: Building a Better Bloom Filter", which discusses that this operation does not affect the performance of the Bloom filter: A standard technique from the hashing literature is to use two hash functions h 1 (x) and h 2 (x) to simulate additional hash functions of the form g i (x) = h 1 (x) + ih 2 (x) . We demonstrate that this technique can be usefully applied to Bloom filters and related data structures. Specifically, only two hash functions are necessary to effectively implement a Bloom filter without any loss in the asymptotic false positive probability. This optimization is very useful, after all hashing is still expensive.

11 12  is negative and negated (the operation here is very rough).

14  Set the corresponding bit in BitArray, and then enter set() to see it.

 1 boolean set(long index) {
 2     if (!get(index)) {
 3         data[(int) (index >>> 6)] |= (1L << index);
 4         bitCount++;
 5         return true;
 6     }
 7     return false;
 8 }
 9   
10 boolean get(long index) {
11     return (data[(int) (index >>> 6)] & (1L << index)) != 0;
12 }

 

02 First get() it to see if it has been set to 1.

03 Shifting the index 6 bits to the right is dividing by 64, indicating that data is an array of long type, and dividing by 64 locates the subscript of the array where the bit is located. 1L shifts the index bit to the left and locates the position of the bit in the long.

 

Next step

The above is a little idea of ​​URL deduplication, I hope it will be helpful to everyone. In the next issue, I intend to introduce the character encoding and decoding, as well as the perfect solution for garbled characters. goodbye!

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324873988&siteId=291194637