python reptile of learning Scrapy

On the way reptiles, learning scrapyis an essential link. Maybe a lot of friends at the moment are being contacted and learn scrapy, so good, we learn together. Came into contact with scrapypeople may have some doubts, after all, it is a framework that up I do not know where to begin. From manual carefully Bo Lord opened scrapythe series to learn, share and how quick start scrapy familiar with it.

Benpian as the first chapter, introduces and understanding scrapy, at the end of the study will recommend one to you about scrapythe book, as well as various ways.

Why reptile framework?

If you have a basic knowledge of reptiles of a certain understanding, then it is time to look at the crawler frame. So why use the crawler frame?

  • Fundamental learning framework is to learn a programming idea should not be limited to how to use it. Learned from the master a framework, in fact, is a thought process of understanding.
  • Framework but also to our development has brought great convenience. Many old rules are already written, and we do not need to repeat create the wheel, we just need to customize their own functions to be implemented according to their needs just fine, greatly reducing the workload.
  • Reference and learn good framework code, the ability to enhance the programming code.

Blogger was carried out according to these points reptiles learning framework, but remember that the goal is to master the core idea of ​​a framework, a framework is the ability to grasp this idea you can be better to use it, and even extend it .

Introduction scrapy framework

Framework reptiles are more popular scrapyand pyspider, but by everyone favorite I think non- scrapynone other than. scrapyIs an open source framework for senior reptiles, we can call it "scrapy language." It uses pythonwritten for crawling web pages, extracting structured data, and structural data will get a better grasp of applied data analysis and data mining. scrapySome of the following characteristics:

  • scrapyEvent-based mechanism, the use twistedof design to achieve a non-blocking asynchronous operation. This compared to traditional blocking requests, greatly improves the CPU usage, efficiency and crawling.
  • Simple configuration, simply by providing a line of code to implement complex functions.
  • Scalable, rich plug-ins, such as distributed scrapy + redis, reptiles visualization plug-ins.
  • Analytical easy to use, scrapyencapsulates xpaththe like parser provides a more convenient and more advanced selectorconfiguration, which can effectively deal with the damaged HTMLcode and coding.

scrapy and requests + bs use which is better?

Some friends asked, why use scrapy, does not use it? With a resquests + beautifulsoupcombination Could not complete it?

Do not tangle, according to their own convenience. resquests + beautifulsoupYes, of course, requests + any parser will do, they are very good combination. The advantage of this is that we can use a flexible write our own code, not necessarily tied to a fixed pattern. For a fixed frame sometimes it is not necessarily easy to use them, such as scrapy for dealing with anti-anti-climb is not perfect, but also own a lot of time to resolve.

But for some small and medium sized reptiles task is concerned, scrapyit is indeed a very good choice, it avoids us to write some duplicate code, and has excellent performance. Write our own code, such as crawling in order to improve efficiency, such as its own code every time multi-threaded or asynchronous code, greatly wasted development time. Use this time frame has been written is no better choice, we simply write about parsing rules and pipelinejust fine. So what specifically do we need it? Consider the following chart to understand.

clipboard.png

Therefore, for the use of which is determined according to individual needs and preferences. But as for the order of learning, it is recommended to learn resquests + beautifulsoup, and then contacted Scrapythe effect may be better, for reference purposes only.

scrapy architecture

In the study Scrapybefore, we need to understand Scrapythe architecture, the architecture is essential to understand scrapy learning.

clipboard.png
Pictures Scrapy official documents

The following description is taken from the official doc document (incorporated herein by reference), I said very clearly understand that this comparison chart to see can understand.

Package

Scrapy Engine
引擎负责控制数据流在系统中所有组件中流动,并在相应动作发生时触发事件。 详细内容查看下面的数据流(Data Flow)部分。

调度器(Scheduler)
调度器从引擎接受request并将他们入队,以便之后引擎请求他们时提供给引擎。

下载器(Downloader)
下载器负责获取页面数据并提供给引擎,而后提供给spider

Spiders
SpiderScrapy用户编写用于分析response并提取item(即获取到的item)或额外跟进的URL的类。 每个spider负责处理一个特定(或一些)网站。

Item Pipeline
Item Pipeline负责处理被spider提取出来的item。典型的处理有清理、 验证及持久化(例如存取到数据库中)。

下载器中间件(Downloader middlewares)
下载器中间件是在引擎及下载器之间的特定钩子(specific hook),处理Downloader传递给引擎的response。 其提供了一个简便的机制,通过插入自定义代码来扩展Scrapy功能。

Spider中间件(Spider middlewares)
Spider中间件是在引擎及Spider之间的特定钩子(specific hook),处理spider的输入(response)和输出(items及requests)。 其提供了一个简便的机制,通过插入自定义代码来扩展Scrapy功能。

数据流过程

  1. Engine open a website (open a domain), found the site of treatment Spiderand the spiderrequest URL (s) first to be crawling.
  2. Engine from the Spideracquired URL to the first to crawling and the scheduler (Scheduler)to Requestschedule.
  3. Engine URL request next taken to climb to the scheduler.
  4. The scheduler returns a URL to be crawled to the engine, the engine will download URL intermediate (request (request)direction) forwarded to the downloader (Downloader).
  5. Once the page is downloaded, the download generates a page Response, and download the middleware (return (response)direction) to the engine.
  6. Received from the engine to the downloader Responseby Spider中间件(input direction) to Spider process.
  7. SpiderProcessing Responseand return to the crawling Itemand (follow-up) a new Request to the engine.
  8. The engine (Spider returned) to crawl to the Item Item Pipeline, the (Spider) return the Request to the scheduler.
  9. (From the second step) is repeated until there is no more scheduler request, engine shut down the site.

 

Guess you like

Origin www.cnblogs.com/mxk123/p/12007019.html