Distributed reptiles principle

Distributed reptiles sucked multiple hosts combined together to accomplish a task crawling, which will greatly improve the efficiency of crawling.

In fact, search engines are crawling, climbing up to take charge of the site content from around the world, when you put the search keywords related to the content displayed to you, but that they are often gray large reptiles, the amount of content is also climb beyond imagination, it can no longer stand-alone reptiles to achieve, but the use of distributed, and one server does not work, I come 1000. I have so many distributed servers around the reptile is to complete the work, get the job full cooperation with each other, ah, so there will be distributed reptiles

A distributed architecture reptiles

Before looking at a distributed architecture reptile, first look Scrapy architecture, as shown below.

Scrapy single crawlers crawl a local queue Queue, the queue is implemented using modules deque. If the new generation will be placed in queue inside the Request, Request is then scheduled Scheduler. After, Request Downloader to perform crawling, simple scheduling architecture as shown in FIG.

If taken simultaneously two Scheduler Request from the queue inside, each has its corresponding Downloader Scheduler, then at sufficient bandwidth, without regard to the normal crawling queue access pressure situation, what happens crawling efficiency? Yes, crawling efficiency will be doubled.

In this way, Scheduler can be extended more, Downloader can be expanded more. The crawl queue Queue is always a must, also known as shared crawling queues. After a scheduled so as to ensure Scheduer Request from the queue, other Scheduler will not repeat this Request scheduling, synchronization can be done more Schduler crawling. This is the basic shape of a distributed crawler, simple scheduling architecture as shown below.

We need to do is to run on multiple hosts at the same time co-crawling reptiles task, but the premise is to share collaborative crawling crawling queues. Thus each of the hosts do not need to crawl queues are maintained, but taken from the shared queue access crawling Request. But individual hosts still have their own Scheduler and Downloader, the scheduler and download functions are completed. If the queue is not considered consumption of the access performance, efficiency or crawling is doubled.

Multi-reptile conditions:

  • You need to share the queue
  • De-duplication, so that no more than reptiles crawling reptiles crawling over other reptiles

Distributed understand reptiles:

  • Assuming that thousands of url need crawling, there are more than 100 reptiles, located in different cities in the country to be distributed to different url reptile, but the efficiency of different reptile is not the same, so that a shared queue, share data, so that efficiency high of reptiles and more to do the task, rather than waiting for the low efficiency of reptiles

Redis

  • Redis is completely free open source, BSD comply with the agreement, it is a high-performance key-value database
  • Memory database, data is stored in memory
  • At the same time can be saved to the hard landing
  • You can go heavy
  • Redis can be understood as a total dict, set, list the aggregate
  • Redis can save the life cycle of content
  • Redis Tutorial: Redis Tutorial - rookie Tutorial

Save the database content

  • MongoDB, running in memory, the data stored in the hard disk
  • MySQL

 

Distributed reptiles structure

Master-slave distributed crawler

  • The so-called master-slave mode, is to act as a server by the master, a number of servers act as a slave, master responsible for managing all slave connected up, including the management slave connection, task scheduling and distribution, recycling and the results are summarized and so on; just from each slave master there to receive the task to complete the task and upload the final result can be alone, do not need to communicate with other slave period. This simple and easy way to manage, but it is clear that master needs to communicate with all slave, then the master of performance has become the bottleneck of the whole system, especially when a large number of slave on the connection, it is very easy to cause the entire system performance reptiles decline
  • Master-slave configuration diagram of the distributed crawling:

Write pictures described here

This is the classic master from the master configuration diagram of a distributed crawler, the control node is ControlNode figures mentioned above, the crawler is SpiderNode slave nodes mentioned above. The picture below shows a schematic diagram of reptiles mission of slave nodes

  • The control node performs a flow chart:

Write pictures described here

Two maps very clear introduction to the entire reptile framework, we are here to comb:

  • 1. The entire distributed crawler system consists of two parts: master control node and slave node reptiles
  • 2.master control node is responsible for: slave node task scheduling, url management, results processing
  • 3.slave reptiles node is responsible for: scheduling node reptiles, HTML download manager, HTML parsing content management
  • 4. The system workflow: master task (not crawling the url) distributed well, slave to receive the task manager by the master URL (url) and complete the corresponding task alone (url) HTML content downloads, content analysis, parsing out content contains the target data and the new url, slave results (data + new target url) submitted to the data extraction process master's (the result of belonging to the master processing) after this work is completed, the process is complete two tasks: to extract a new url url manager to pay, post-extraction target data in the data storage process, for verification (whether or crawling through) the master management process url url received and processed (not to be added to the crawling crawl url set climb has been added to crawl url collection), then the slave task from circulation get url manager, perform tasks, submit the results ......

 

Second, the maintenance of crawling queue

What used to maintain the queue? First to consider is the performance issue. We naturally think of memory-based storage Redis, it supports a variety of data structures, such as a list (List), the set (Set), ordered collection (Sorted Set), etc., access operation is very simple.

The data structure stored Redis supports these types have their advantages.

  • List have lpush(), lpop(), rpush(), rpop()methods, we can use it to achieve FIFO type crawling queues, last-out can be achieved Stacker crawling queues.

  • Set of elements is a non-repeating random, so that we can very easily implement non-repeating random ordering crawling queue.

  • Ordered collection with scores represented, while the Request Scrapy has priority control, we can use it to queue with priority scheduling.

We need the flexibility to choose different queues according to the specific needs of reptiles.

Third, how to weight

Scrapy automatic de-emphasis, it is set to re-use the Python. This set a record Scrapy fingerprint of each Request, this fingerprint is actually a hash value of Request.

Scrapy we can look at the source code, as follows:

import hashlib
def request_fingerprint(request, include_headers=None):
   if include_headers:
       include_headers = tuple(to_bytes(h.lower())
                                for h in sorted(include_headers))
   cache = _fingerprint_cache.setdefault(request, {})
   if include_headers not in cache:
       fp = hashlib.sha1()
       fp.update(to_bytes(request.method))
       fp.update(to_bytes(canonicalize_url(request.url)))
       fp.update(request.body or b'')
       if include_headers:
           for hdr in include_headers:
               if hdr in request.headers:
                   fp.update(hdr)
                   for v in request.headers.getlist(hdr):
                       fp.update(v)
       cache[include_headers] = fp.hexdigest()
   return cache[include_headers]

request_fingerprint()The method is to calculate Request fingerprints, which method use is hashlib the sha1()method. The fields of computing Request Method, URL, Body, Headers these part, as long as there is a little different, then the result of the calculation is different. Results are calculated encrypted string, i.e. fingerprint. Each Request has a unique fingerprint, the fingerprint is a string, the string is determined whether to repeat the Request object to determine whether the ratio is much easier to repeat, so fingerprints can be used as the basis for judging whether to repeat the Request.

So how do we determine repeat it? Scrapy is achieved, as follows:

def __init__(self):
   self.fingerprints = set()

def request_seen(self, request):
   fp = self.request_fingerprint(request)
   if fp in self.fingerprints:
       return True
   self.fingerprints.add(fp)

To the weight class RFPDupeFilter, there is a request_seen()method that has a parameter request, its role is to detect whether the Request object is repeated. This method calls request_fingerprint()get a fingerprint of the Request, testing whether this fingerprint exists in the fingerprintsvariable, and fingerprintsis a collection of elements in the collection are not repeated. If the fingerprints exist, then returned True, indicating that the Request is repeated, otherwise the fingerprint is added to the collection. If the next time there is the same Request pass over the fingerprint is the same, then the time fingerprint already exists in the collection, Request objects will be directly determined to repeat. The purpose of this deduplication is realized.

Scrapy deduplication process is the use of set of elements will not be repeated to achieve the characteristics of heavy Request.

For distributed crawler, we certainly can not be reused for each reptile respective collections to come and go heavier. Because this is still maintained separately for each host their own collection, do not share. If multiple hosts generating the same Request, can only go to their weight, among various hosts will not be able to do heavy.

So to achieve the de-emphasis, this collection of fingerprints also need to be shared, Redis just have a set of data structures, we can use Redis as a collection of fingerprint collection, then this collection is to re-use Redis shared. After each host a new generation Request, the Request of the fingerprint and collection of comparison, if the fingerprint already exists, indicating that the Request is repeated, otherwise Request fingerprints can be added to this collection. The same principle of using different storage structure we have achieved re-distributed to the Reqeust.

Fourth, prevent disruption

In the Scrapy, Request crawler run queue in memory. After reptiles outage, the queue space is released, the queue is destroyed. So once reptiles outages, reptiles running again is equivalent to a new crawling process.

To do continue after an interruption crawling, we can save up the queue Request, the next crawling directly read the saved data to get the last crawling queues. We specify a storage path crawling queue can, in this path Scrapy the JOB_DIRvariable to identify, we can use the following command to achieve:

scrapy crawl spider -s JOB_DIR=crawls/spider

A more detailed usage can be found in the official documentation, link to: https: //doc.scrapy.org/en/latest/topics/jobs.html.

In Scrapy, we actually save the crawl queue to the local, the second crawling queues can directly read and restored. So in a distributed architecture, we have to worry about this problem? You do not need. Because crawling queue itself is stored in the database, if the reptile is interrupted, the database Request still exists, then it will start where it left off next time to continue crawling.

So, when the queue is empty Redis, will re-crawling reptiles; Redis when the queue is not empty, crawler will then place it left off to continue crawling.

Fifth, architecture implementation

Then we would need to implement this architecture in the program. First, to achieve a shared crawl queue, but also to achieve deduplication functionality. Further, a Scheduer overriding implementations of the queue so that it can access the shared Request from crawling.

Fortunately, it has been achieved and these logical architecture, and published as called Scrapy-Redis Python package. Next, we look at the source code to achieve Scrapy-Redis, and it works in detail.

 

[Reserved] to

He published 191 original articles · won praise 203 · views 90000 +

Guess you like

Origin blog.csdn.net/Aibiabcheng/article/details/105170221