Build a simple distributed crawler with scrapy-redis


Scrapy is a well-known crawler framework in the python world. Scrapy is an application framework written to crawl website data and extract structural data. It can be used in a series of programs including data mining, information processing or storing historical data. 
Although scrapy can do a lot of things, it is not enough to achieve large-scale distributed applications. Some people have changed the queue scheduling of scrapy, separated the starting URL from start_urls, and read it from redis. Multiple clients can read the same redis at the same time, thus realizing a distributed crawler. Even on the same computer, the crawler can be run in multiple processes, which is very effective in the process of large-scale crawling.

Prepare

Since distributed crawling can be achieved so well, what do we need to prepare? 
There are many things that need to be prepared, including: 
scrapy 
scrapy-redis 
redis 
mysql 
python's mysqldb module 
python's redis module 
Why do you need mysql? It is because we intend to store the collected data in mysql

1. scrapy installation

pip install scrapy
  • 1
  • 2

You can also clone the corresponding github address https://github.com/scrapy/scrapy/tree/1.1

2. scrapy-redis installation

pip install scrapy-redis
  • 1
  • 2

You can also clone the corresponding github address https://github.com/rolando/scrapy-redis 
What is the difference between them? https://www.zhihu.com/question/32302268/answer/55724369 You know the answer of the great god

3.say again

Redis itself only provides installation in a linux-like environment and does not support windows. The official website is http://redis.io/ . If you need to practice under windows, you can refer to my article http://blog.csdn. net/howtogetout/article/details/51520254

4.mysql

Because we plan to use mysql to store data, the configuration of mysql is indispensable. Download addresshttp: //dev.mysql.com/downloads/

5.mysqldb module and redis module

The reason why these two are needed is because python cannot directly operate the database and needs to be supported by a library. And these two are the support libraries of the corresponding database. 
mysqldb: https://sourceforge.net/projects/mysql-python/files/mysql-python/1.2.3/ , in the windows environment, you can directly download .exe to quickly install 
redis:

pip install redis

This is the easiest.

start construction

Let's take a look at some of the different aspects of scrapy-redis. 
scrapy1-redis 
The first is that the parent object of the class has changed, and it has become a unique RedisSpider, which is a new type of crawler defined by scrapy-redis. The second is that there is no longer start_urls, but redis_key instead. Scrapy-redis pops the key from the list to become the requested url address.

The object we selected this time is the tablet information of 58.com.

Let's first look at the architectural information. 
58.1 
We can ignore the scrapy.cfg file and the readme.rst file (this is useful on github, not when scrapy created the project) 
The structure in the pbdnof58 folder: 
58.2 
items definition file, settings setting file, pipeline processing file and spiders folder. 
The spiders folder contains the specific crawlers we wrote: 
58.3 
you can see that there are 2 crawlers in it, one is used to crawl all the url addresses and pass them to redis. The other is to process specific commodity information according to the address crawled out.

Specifically. First is the settings.py file. 
settings1 
Like scrapy, specify the spider's location. 
pipeline 
2 classes in the pipeline that process data, the smaller the number, the first to execute. 
mysql 
Because the data needs to be stored in mysql, you need to configure the mysql information. Redis is local by default, so there is no configuration information. If you are connecting to other hosts, you need to configure the connection address of redis  .
Compared with scrapy, the item.py file 
item 
has one more scheduling file and one more ItemLoader class. Just do the same. The ItemLoader class will be used later. 
The most important thing about the pipeline.py file 
mysqlpipe
is that this stores the results into mysql. 
To create a table named 58pbdndb in a database named qcl. qcl corresponds to the configuration of settings.

create table 58pbdndb(
   id INT NOT NULL AUTO_INCREMENT,
   title VARCHAR(100) NOT NULL,
   price VARCHAR(40) NOT NULL,
   quality VARCHAR(40),
   area VARCHAR(40),
   time VARCHAR(40) NOT NULL,
   PRIMARY KEY ( id )
)DEFAULT CHARSET=utf8;
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

Note : I didn't check for field existence at the beginning of the table, if you are debugging more than once, you may need to delete data from the table multiple times. 
58Urlspider.py file 
58url 
This crawler implements 2 functions. 1 is if the next (that is, the next page) exists, then push the address of the next page into the list of myspider:58_urls of redis for yourself to continue to crawl. 2 is to extract the specific URL of the product you want to crawl, and press it into the list of myspider:start_urls of redis for another crawler to crawl. 
58spider-redis.py file 
58spider-redis
This crawler is used to grab specific product information. You can see the use of the add_path and add_value methods of the ItemLoader class.

finally

The operation method is the same as scrapy, which is to enter the pbdnof58 folder (note that the following is the one with only the spiders folder). Enter

scrapy crawl myspider_58page和scrapy crawl myspider_58
  • 1
  • 2

You can enter more than one to observe the effect of multi-process. . After opening the crawler, you will find that the crawler is in a state of waiting for crawling, because both lists are empty at this time. so need

lpush myspider:58_urls http://hz.58.com/pbdn/0/
  • 1
  • 2

Let's set an initial address, okay, so you can happily see all the crawlers moving. 
Finally, a picture of the database 
database

ps: The github address of this article: https://github.com/qcl643062/spider/tree/master/pbdnof58

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324652212&siteId=291194637