Python crawler framework-scrapy (1) introduction to the scrapy crawler framework

Introduction: (There are many crawler frameworks in the python language, this article and the next few articles only introduce the scrapy framework)
1: Organize various knowledge of the scrapy crawler framework components, understand the principles of the crawler mechanism
1.scrapy architecture:
Insert picture description here
each component:
The engine (EGINE)
is responsible for controlling the data flow between all components of the system and triggering events when certain actions occur. For details, please refer to the data flow section above. The
scheduler (SCHEDULER) is
used to accept the request sent by the engine, push it into the queue, and return it when the engine requests it again. It can be imagined as an advanced queue of URLs, It decides what the next URL is to be crawled, and at the same time, the duplicate URL
downloader (DOWLOADER) is
used to download web content and return the web content to EGINE. The downloader is built on the efficient asynchronous model of twisted The
spider (SPIDERS)
SPIDERS is a developer-defined class used to parse responses, extract items, or send new requests. The
item pipeline (ITEM PIPLINES)
is responsible for processing items after they are extracted, including cleaning, verification, and persistence. of (for example to the database) and other operations
to download middleware (Downloader middlewares)
positioned between the engine and the downloader Scrapy, mainly used to process the request from the request transmitted DOWLOADER EGINE, has spread from the response EGINE response DOWNLOADER.
crawler Spider Middlewares
Located between EGINE and SPIDERS, the main job is to process the input (responses) and output (requests) of SPIDERS

The data flow in Scrapy is controlled by the execution engine, and then proceeds as follows :
1. The engine obtains the initial request from the spider.
2. The engine schedules the request in the scheduler and asks the next request to crawl.
3.3. The scheduler returns the next request to the engine.
4. The engine sends the request to the downloader through the downloader middleware (see process_request()).
5. After the download is complete, the downloader will generate a response (together with the page) and send it to the engine through the downloader middleware.
6. The engine receives the response from the downloader and sends it to the spider for processing, through the spider middleware (process_spider_input()).
7. The spider processes the response, and returns the scraped items and new requests (follow) to the engine through the spider middleware (process_spider_output()).
8. The engine sends the processed items to the project pipeline, and then sends the processed requests to the scheduler, and asks the possible next request to crawl.
9. The process repeats (from step 1) until there is no request from the scheduler.

Simulation process:
Engine: Hi! Spider, which website are you going to deal with?
2. Spider: The boss wants me to deal with xxxx.com.
3. Engine: Give me the first URL that needs to be processed.
4. Spider: Here you are, the first URL is xxxxxxx.com.
5. Engine: Hi! Scheduler, I have a request for you to sort me into the queue.
6. Scheduler: Okay, it's processing you wait a moment.
7. Engine: Hi! Scheduler, give me the request you have processed.
8. Scheduler: Here you are, this is the request I processed
9. Engine: Hi! Downloader, you can download this request for me according to the boss's download middleware setting.
10. Downloader: OK! Here you are, this is the downloaded stuff. (If it fails: sorry, the download of this request failed. Then the engine tells the scheduler that the download of this request failed, you record it and we will download it later)
11. Engine: Hi! Spider, this is a downloaded thing, and it has been processed according to the boss's download middleware, you can handle it yourself (note! The responses here are handled by the def parse() function by default)
12.Spider: (processing completed After the data, for the URL that needs to be followed up), Hi! Engine, I have two results here, this is the URL I need to follow up, and this is the Item data I get.
13. Engine: Hi! I have an item here for the pipeline. You can handle it for me! scheduler! This is the URL that needs to be followed up for you to handle it for me. Then start the loop from the fourth step, until the boss needs all the information.
14. Pipeline" scheduler: OK, do it now!

2. Install the scrapy framework (the scrapy installation network has a lot of information, so I won't repeat it here)
Under windows system:
pip install scrapy After the
installation is successful, enter scrapy in the command terminal, and the prompt is similar to the figure, which means the installation is successful.
Insert picture description here
3. Basic operation of scrapy framework

Create a new scrapy project
Scrapy startproject + project name
After creating a new project , open the following files:
Insert picture description here
scrapy.cfg: configuration file, used to store the configuration information of the project.
minyan: the python module of the project, the code will be applied here.
Items.py: Entity file, used to define the target entity of the project.
Middlewares.py: Middleware file, used to define spider middleware.
Pipelines.py: The pipeline file is used to define the pipeline used by the project.
Spiders: The directory where crawler code is stored.
settings.py: crawler program settings, mainly some priority settings, the higher the priority, the smaller the value.

Create a crawler:

scrapy genspider + crawler name + crawler domain name
For example:
scrapy genspider itcast'itcast.com'

Run crawler:
scrapy crawl + crawler name

OK:

Xpath can greatly improve efficiency, and it is not easy to make mistakes.

Save the file:
Pass the crawled data to the pipeline through the pipeline, and the pipeline is saved to a local file. This kind of storage is called (persistent storage).

Guess you like

Origin blog.csdn.net/qq_45976312/article/details/113101525