From getting started with scrapy to giving up 02: the whole architecture diagram, developing a program

foreword

Scrapy wrote some pure theoretical knowledge in the opening article, and this second article will go straight to the topic. Let's talk about the architecture of Scrapy first, and develop a Scrapy crawler program from scratch.

This article mainly explains the Scrapy architecture, clarifies the development process, and masters the basic operations.

Overall structure

Draw a structure diagram by yourself:

Scrapy Architecture

This is the overall structure of Scrapy. It seems that the process is more complicated, but in fact, there are not many parts that require developers to participate. Here is a brief introduction to each part.

  1. Spider : The crawler program to be developed is used to define the website entry, implement parsing logic and initiate requests.
  2. Pipeline : data pipeline, which can customize the way of data persistence.
  3. Middleware : middleware, divided into two categories. One is downloader middleware, which mainly processes requests and is used to add request headers, proxies, etc.; the other is spider middleware, which is used to process responses and is rarely used.
  4. Scheduler : Scheduler, used to store requests from crawlers.
  5. Downloader : Downloader. Initiate a request to the target website and obtain the response content.

For a complete crawler, developers need to participate in the development of parts 1, 2, and 3. Even the simplest crawler only needs to develop the Spider part.

Preparation

Install Scrapy

The installation of Scrapy is the same as that of normal modules:

pip3 install scrapy

After installation, there will be a scrapy command, we can use this command to create a new project, create a new crawler program, enter the shell interactive environment, and so on.

The command description is as follows:

scrapy

New Project

Unlike ordinary python projects, Scrapy needs to use the command line to create a new project, and then import it into the IDE for development.

scrapy startproject [ProjectName]

Execute the above command to create a new Scrapy project.

startup project

As can be seen from the project structure, a Scrapy project is divided into four major modules, corresponding to each part of the architecture.

Four modules

Create a new crawler

Import the project into the IDE, and the spiders package is used to store the developed crawler program. The new crawler program is also operated through the command line.

# domain就是域名,例如百度域名就是www.baidu.com
scrapy genspider [SpiderName] [domin]

Executing this command on the command line in any directory of this scrapy project will create a new crawler program under spiders.

genespider

Bot development

As shown in the figure, the scrapy crawler program has been generated, and the code for parsing rules can be implemented in it to complete the development.

Here we still take Douluo Continent as an example, the program code is as follows.

Douluo Continent

program structure

Every Scrapy program will have three modules:

  1. name : The name of the crawler in each project, used as a unique identifier for the startup of the crawler
  2. allowed_domains : Mainly used to limit the domain names of crawler websites
  3. start_urls :: website entry, start url
  4. parse : the default first parsing function

As mentioned above, start_urls is the entry point of the crawler program, so how does it initiate a request and pass the Res response to parse for analysis? As a list type, can there be multiple entry urls?

start_requests()

Each crawler program inherits the Spider class, and the start_requests method inside is used to initiate a request and automatically pass the response to parse().

As shown in the figure, we can see that this method traverses start_urls to initiate a request. Then, I don't want to pass it to parse() for parsing, I just want to customize the method, what should I do?

Don't panic about small things, let's just rewrite start_requests.

As shown in the figure, we have customized the parse_first parsing function, and use callback to specify the callback function when initiating a request. Remember here: the function name must not be enclosed in parentheses. Adding parentheses means executing the function immediately, and not adding it means it is a reference .

The output of the modified program is the same as before.

Request

We use yield Request to initiate a request, why not return? Because yield does not return immediately, it does not terminate the method. This involves the issue of generators, and those who are interested can study it.

The parameters used by Request are arranged in the following order:

  1. url : the url to request
  2. callback : the callback function to process the response
  3. meta : dictionary, pass the kv data to the callback function through the response
  4. dont_filter : The default is False, that is, url deduplication is enabled. If we write two identical urls in start_urls, the result will only be output once, and if we modify it to True, it will output twice.
  5. method : request method, the default is get
  6. priority : request priority, the default is 0, the greater the value, the greater the priority

As for cookies and headers parameters, we can set them in Request, but most of the time they are set in the downloader middleware.

Bot starts

There are two main ways to start the Scrapy crawler program.

command line start

The first is to start under the command line in the scrapy project directory.

scrapy crawl [SpiderName]

The disadvantage of this startup method is obvious, that is, the Debug function cannot be used in the IDE, so this method is usually used for production .

IDE start

We usually use the second startup method during development, which is also in line with our regular startup procedures. Create a new python program and introduce the command line tool to execute the crawler startup command.

from scrapy.cmdline import execute

if __name__ == "__main__":
    execute("scrapy crawl DouLuoDaLu".split(" "))

In this way, the program can be started in the IDE and the Debug function can be used.

Debug

scrapy shell interactive environment

We can debug the parsing code in the shell interactive environment.

scrapy shell https://v.qq.com/detail/m/m441e3rjq9kwpsc.html

Enter the command and press Enter to initiate a request to the Douluo Dalu page and enter the shell environment.

shell

As shown in the figure, after entering the shell environment, some variables are automatically encapsulated. Here we only focus on the response response.

response

As shown in the figure, we parse the webpage in the shell interactive environment. In this way, we can copy the tested parsing code into the program, which improves the development efficiency.

Enter view(response), press Enter, and the page will be automatically opened in the browser.

epilogue

In the sample program, requests and responses simply flow on the right half of the architecture diagram. If you want to persist, you need to define pipelines, etc., and only one layer of parsing function is written in the program, that is, parse().

If in-depth crawling is required in parse, we also need to initiate a request in parse and define a new callback function for parsing until we reach the data page we want. Of course, these will be mentioned later.

Since the beginning of the Scrapy series was written, it has been put on hold for a long time. One is that I have been really busy recently, and the other is that I have a lot of knowledge about Scrapy, so I don’t know where to start writing. However, I will continue to write. Although the update may be a bit slow, friends are welcome to remind me to update, and I hope to give you more valuable opinions.



Post-95 young programmers, write about personal practice in daily work, from the perspective of beginners, write from 0 to 1, detailed and serious.

The article will be published on the public account [ Getting Started to Give Up Road ], looking forward to your attention.

Thanks for every attention

Guess you like

Origin blog.csdn.net/CatchLight/article/details/118541600