Sesame HTTP: Overview of Python Crawler Advanced Crawler Framework

Overview

After getting started with reptiles, we have two paths to take.

One is to continue in-depth learning, as well as some knowledge about design patterns, strengthen Python-related knowledge, build wheels by yourself, and continue to add distributed, multi-threaded and other functional extensions to your crawler. Another way is to learn some excellent frameworks. Familiarize these frameworks first to ensure that you can cope with some basic crawler tasks, which is the so-called solution to the problem of food and clothing, and then in-depth study of its source code and other knowledge to further strengthen.

Personally, the former method is actually to build the wheels by yourself. The predecessors actually already have some better frameworks, which can be used directly, but in order to be able to study more deeply and have a more comprehensive understanding of crawlers , do it yourself. The latter method is to directly use the relatively excellent frameworks that have been written by predecessors and use them well. First, make sure that you can complete the tasks you want to complete, and then study them in depth. For the first one, the more you explore yourself, the more thorough your knowledge of reptiles will be. The second is to use other people’s, which is convenient for you, but you may not have the mood to study the framework in depth, and your thinking may be constrained.

Personally, however, I prefer the latter. Building wheels is good, but even if you build wheels, aren't you building wheels on the base class library? Use what you can use. The role of learning the framework is to ensure that you can meet the needs of some reptiles. This is the most basic problem of food and clothing. If you have been building wheels, but nothing has been built in the end, and others have asked you to write a reptile research for so long and can't write it, wouldn't it be a bit of a loss? Therefore, for advanced reptiles, I still recommend learning the framework as a few weapons for yourself. At least, we can do it, just like you took a gun to the battlefield, at least, you can hit the enemy, much better than you have been sharpening your knife?

Framework overview

Bloggers have been exposed to several crawler frameworks, among which Scrapy and PySpider are more useful. Personally, pyspider is easier to get started and easier to operate, because it adds a WEB interface, writes crawlers quickly, integrates phantomjs, and can be used to grab pages rendered by js. Scrapy has a high degree of customization and is lower-level than PySpider. It is suitable for study and research. There is a lot of relevant knowledge to learn, but it is very suitable to study distributed and multi-threading by yourself.

Here, bloggers will write down their own learning experiences and share them with you. I hope you can like them, and I hope they can give you some help.

PySpider

PySpider is an open source implementation of a crawler architecture made by binux . The main functional requirements are:

  • Crawl, update and schedule specific pages of multiple sites
  • Need to extract structured information from the page
  • Flexible and scalable, stable and monitorable

And this is also the requirement of most python crawlers - directional crawling, structured analysis. However, in the face of various websites with different structures, a single crawling mode may not be sufficient, and flexible crawling control is necessary. In order to achieve this purpose, simple configuration files are often not flexible enough, so controlling the crawling through scripts is the last option.
And functions such as de-rescheduling, queuing, grabbing, exception handling, and monitoring are provided as a framework for grabbing scripts and ensure flexibility. Finally, the editing and debugging environment of the web and the monitoring of web tasks become the framework.

The design basis of pyspider is: a crawler of the grasping ring model driven by a python script

  • Extract structured information through python script, follow link scheduling and grab control to achieve maximum flexibility
  • A web-based scripting and debugging environment. web display scheduling status
  • The grasping ring model is mature and stable, and the modules are independent of each other. They are connected through message queues, and can expand flexibly from single process to multi-machine distributed distribution.

pyspider-arch

The architecture of pyspider is mainly divided into scheduler (scheduler), fetcher (grabber), processor (script execution):

  • The various components are connected by message queues. Except the scheduler is a single point, both the fetcher and the processor can be deployed in a multi-instance distributed manner. scheduler is responsible for overall scheduling control
  • The task is scheduled by the scheduler, the fetcher crawls the web page content, the processor executes the pre-written python script, and outputs the result or generates a new link-up task (sent to the scheduler) to form a closed loop.
  • Each script can flexibly use various python libraries to parse the page, use the framework API to control the next grabbing action, and control the parsing action by setting callbacks.

Scrapy

Scrapy is an application framework written to crawl website data and extract structural data. It can be used in a series of programs including data mining, information processing or storing historical data.
It was originally designed for web scraping (more precisely, web scraping), but can also be used to retrieve data returned by APIs (eg Amazon Associates Web Services) or general-purpose web crawlers. Scrapy is versatile and can be used for data mining, monitoring, and automated testing

Scrapy uses the Twisted asynchronous networking library to handle network communication. The overall structure is roughly as follows

Scrapy

Scrapy mainly includes the following components:

  • Engine (Scrapy): used to process data stream processing of the entire system, trigger transactions (framework core)
  • Scheduler: It is used to accept the request sent by the engine, push it into the queue, and return it when the engine requests it again. It can be imagined as a priority queue of URLs (website URLs or links that are crawled), by It decides what is the next URL to crawl, and removes duplicate URLs
  • Downloader (Downloader): used to download web content and return the web content to the spider (Scrapy downloader is built on the efficient asynchronous model of twisted)
  • Crawlers (Spiders): Crawlers are mainly used to extract the information they need from a specific web page, the so-called entity (Item). Users can also extract links from it and let Scrapy continue to crawl the next page
  • Pipeline: Responsible for processing entities extracted from web pages by crawlers. The main functions are to persist entities, verify the validity of entities, and remove unnecessary information. When the page is parsed by the crawler, it is sent to the project pipeline, and the data is processed in several specific orders.
  • Downloader Middlewares: The framework between the Scrapy engine and the downloader, which mainly handles requests and responses between the Scrapy engine and the downloader.
  • Spider Middlewares: A framework between the Scrapy engine and crawlers, the main job is to process the spider's response input and request output.
  • Scheduler Middewares: The middleware between the Scrapy engine and the scheduler, the requests and responses sent from the Scrapy engine to the scheduler.

The Scrapy running process is roughly as follows:

  • First, the engine fetches a link (URL) from the scheduler for subsequent crawls
  • The engine encapsulates the URL into a request (Request) and sends it to the downloader. The downloader downloads the resource and encapsulates it into a response packet (Response).
  • Then, the crawler parses the Response
  • If the entity (Item) is parsed, it will be handed over to the entity pipeline for further processing.
  • If the parsed is a link (URL), then hand the URL to the Scheduler to wait for crawling

Epilogue

 

After a basic introduction to these two frameworks, I will introduce the installation of these two frameworks and how to use the frameworks, hoping to help you.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326349129&siteId=291194637
Recommended