Lecture 49: Get started in actual combat, Scrapy-Redis distributed implementation

In the previous lesson, we learned about the basic principles of Scrapy-Redis. In this lesson, we will combine the previous cases to implement a distributed crawler based on Scrapy-Redis.

1. Environmental preparation

In this section of the case, we are based on Lecture 46-Scrapy and Pyppeteer's dynamic rendering page crawling case to learn, we need to rewrite it into a Redis-based distributed crawler.

First of all, we need to download the code, its GitHub address is https://github.com/Python3WebSpider/ScrapyPyppeteer , enter the project, try to run the code to ensure that it can be executed smoothly, the running effect is shown in the figure:
Insert picture description here

Secondly, we need to have a Redis database. You can download the installation package directly and install it, or you can use Docker to start it to ensure normal connection and use. For example, I started a Redis database on the local localhost here, running on port 6379. The password is empty.

In addition, we also need to install the Scrapy-Redis package, the installation command is as follows:

pip3 install scrapy-redis

After installation, make sure it can be imported and used normally.

2. Realize

Next, we only need a few simple steps to configure the distributed crawler.

2.1 Modify Scheduler

In the previous lesson, we explained the concept of Scheduler, which is used to process the scheduling logic of Request, Item and other objects. By default, the Request queue is in memory. In order to achieve distributed, we need to migrate the queue In Redis, we need to modify the Scheduler at this time. The modification is very simple, just add the following code in settings.py:

SCHEDULER = "scrapy_redis.scheduler.Scheduler"

Here we modify the Scheduler class to the Scheduler class provided by Scrapy-Redis, so that when we run the crawler, the Request queue will appear in Redis.

2.2 Modify Redis connection information

In addition, we need to modify the Redis connection information so that Scrapy can successfully connect to the Redis database. The modification format is as follows:

REDIS_URL = 'redis://[user:pass]@hostname:9001'

Here we need to modify according to the above format. Since my Redis is running locally, there is no need to fill in the user name and password here, just set it as follows:

REDIS_URL = 'redis://localhost:6379'

2.3 Modify the deduplication class

Since the Request queue has been migrated to Redis, we also need to migrate the corresponding deduplication operations to Redis. In the previous lesson, we explained the principle of Dupefilter. Here we will modify the reclassification to achieve Redis-based deduplication:

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

2.4 Placement endurance

Generally speaking, after the Redis distributed queue is turned on, we don’t want the crawler to delete the entire queue and deduplication information when it is closed, because it is very likely that we will manually close the crawler or the crawler will terminate unexpectedly in some cases. For this problem, we can configure the persistence of the Redis queue and modify it as follows:

SCHEDULER_PERSIST = True

Well, so far we have completed the configuration of the distributed crawler.

3. Run

What we have done above is actually not a real distributed crawler, because we use local Redis for the Redis queue, so multiple crawlers need to run locally. If you want to achieve a real distributed crawler, you can use remote Redis, so that we can run crawlers on multiple hosts to connect to this Redis to achieve a true distributed crawler.

But it doesn't matter, we can start multiple crawlers locally to verify the crawling effect. We run the following commands in multiple command line windows:

scrapy crawl book

The first crawler has the following operating effects:
Insert picture description here
Do not close this window at this time, open another window, and run the same crawling command:

scrapy crawl book

The operation effect is as follows: At
Insert picture description here
this time, we can observe that it starts crawling from page 24, because the current crawling queue contains the crawling Request generated by the first crawler, and the second crawler detects the existence of Request when it starts to read directly Take the existing Request, and then crawl it.

Similarly, we can start the third and fourth crawlers to achieve the same crawling function. In this way, we successfully implemented the basic distributed crawler function based on Scrapy-Redis.

Alright, the content of this class is over. See you in the next class.

Guess you like

Origin blog.csdn.net/weixin_38819889/article/details/108309308