爬虫框架Scrapy的组件spider

一、组件Spider定义

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site (or, in some cases, a group of sites).

爬虫是抓取特定网页的类

For spiders, the scraping cycle goes through something like this:

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests. 爬虫从初始url中发起请求，并产生回调
In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.　　爬虫在回调函数中生成Item数据模型
In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data. 爬虫使用选择器，选择器包括css和xpath，从抓取的网页中读取数据
Finally, the items returned from the spider will be typically persisted to a database (in some Item Pipeline) or written to a file using Feed exports. 爬虫需要将数据存储在文件中，或者存储在数据库中

二、基本的Spider类定义

三、Spider的基本属性和基本生命周期方法

基本属性

1、name　　爬虫的名称、爬虫程序的唯一标识

2、allowed_domains

3、start_urls　　抓取的url队列

4、custom_settings

5、crawler

6、settings

7、logger

基本生命周期方法

1、from_crawler(crawler, *args, **kwargs)

2、start_requests()

3、parse(response)

4、closed(reason)

5、log(message[, level, component])