Scrapy framework
Architecture
- Engine - Engine: Processes data streams, triggers transactions.
- item - item: data structure, class.
- Schedule - Scheduler: handles the request queue.
- Download - Downloader: Request.
- Spiders - Spiders: Crawl logic and web parsing rules.
- item Pipeline - Item pipeline: processing result data, cleaning and warehousing, etc.
- Downloader Midddlewares - Downloader Middlewares
- Spider Midddlewares - Spider Middleware
data flow
command line invocation
- subproject
- Engine finds the corresponding Spider and gets the first url.
- The spider sends the first url to the dispatcher.
- The scheduler returns the url in the queue to the engine, and the engine sends the url to the Downloader to request the page through the middleware.
- After the downloader completes the request, it sends the response result to the engine, and the engine forwards it to the spider.
- The spider handles the response, extracting to an item or a new request. Return to engine.
- The engine sends the item to pipline and the url to the scheduler.
Project structure
scrapy startproject tutorial
scrapy genspider qutes qutes.toscrape.com
project
item
Used to store data objects, similar to dictionary usage (item['title']='title'). Define the required fields.
spider
- name: The name of the Spider subproject. (start subproject, scrapy crawl name , unique)
- allowed_domains: Domain names that this subproject is allowed to crawl.
- start_urls: A list of Spider's initial startup urls.
- parse(): After the url in start_urls is requested, the returned response body will be passed to this method for processing, and then return item or a new request.
def parse(self, response): article_list = response.css(".common-list") article_list = article_list.xpath(".//ul/li") for article in article_list: item = ArticleItem() item['link'] = article.xpath(".//a/@href").extract_first() item['title'] = article.xpath(".//a/@title").extract_first() item['content'] = article.xpath(".//a/text()").re_first('(.*)') # yield item # 返回item对象 交给piplines处理 # yield scrapy.Request(url=item['link'],meta={'item':item},callback=self.parse_detail) # 返回一个新的请求,请求之后的响应体交给自己定义的parse_detail方法处理,若有参数需要传递,用meta传递和提取。
pipline
After the data extraction is completed and the item is constructed, it can be saved to a file by adding command line parameters when starting the subproject, such as:
- scrapy crawl quotes -o quotes.json # Save as JSON file
- scrapy crawl quotes -o quotes.csv # Save as csv|xml|pickle|marshal file
- scrapy crawl quotes -o ftp://user:[email protected]/path/to/quote/quotes/csv # can also save to remote
Besides, if you have more complex requirements, you can write in pipline:
-
pipline usage
- Pipline is expressed as a Pipline class, and a process_item() method is defined in this class. After this class is enabled, this method will be called. This method must return an item object or a DropItem exception (requires import from scrapy.exceptions import DropItem).
- In the following example, two pipline classes are defined:
- The first class, ArticlePipeline, hopes to clean the data in the item object. For example, intercept the first 50 words of the title.
- The second class, MongoPipeline, wants to store item objects in the mongodb database. In the from_crawler class method, the global configuration in settings.py is obtained through the global configuration parameter cralwer. open_spider and close_spider are called when the spider is opened or closed, respectively.
-
pipeline call method
- After the spider returns the item, all the Pipeline class methods are called in priority order.
- After the Pipeline class is defined, if you need to call it, you need to add the class to ITEM_PIPELINES in settings.py.
usage
Usage of spider
Define the action of crawling the website; analyze the crawled pages.
- name: the name of the crawler, which is a string that defines the name of the spider. The spider's name defines how Scrapy locates and initializes the spider, and it must be unique. A common naming method is the name of the domain name.
- allowed_domains: Domain names that are allowed to be crawled. This is an optional configuration. Connections outside this range will not be crawled.
- start_urls: start urls. Equivalent to the initial value of the scheduler.
- custom_settings: spider configuration, override global settings. class variable
- crawler: Set by from_crawler() to read configuration information.
- settings: It is a Setting object to get global variables.
- start_requests(): Initial request. Must return an iterable object.
- parse(): This function is called when the Response body has no callback function.
- closed(): Called when the spider is closed. Used to release resources, etc.
Download Middleware Usage of Download Middleware
When the scheduler sends the request to the downloader and the downloader sends the response to the spider, it will go through the download middleware
illustrate
Download middleware can be used to modify User-Agent, handle redirection, set proxy, retry on failure, set cookies and other functions.
- Each middleware contains one or more methods, of which there are three core methods,
- process_request(request,spider) # Called before the request is sent by the engine to the downloader #return requests\response\None\exception
- process_response(request,response,spider) # After the downloader executes the request, it sends the response body to the spider #return response\None\exception
- process_exception(request,exception,spider) # Called when the downloader or process_request throws an exception #return requests\response\None
- scrapy provides middleware such as failure retry, automatic redirection, etc., where each middleware is executed in order of priority. For the process_request() method, the smaller the priority number, the more obvious it will be called, and for the process_response() method, the higher the priority, the earlier it will be called
return value
- The return value of process_request is one of None, Response object, Request object, or throws an IgnoreRequest exception
- The return is None, and the subsequent middleware according to the priority then executes the process_request method (which can be understood as modifying the request method
- Return as a Response object, execute the low-priority process_response method (equivalent to processing the response body information
- Returned as a Request object, the low-priority middleware stops executing, and then adds this Request to the scheduler. (equivalent to refactoring a new request
- Throws an IgnoreRequest exception, and the low-priority process_exception will be executed in turn. If it is still abnormal, the errorback() of the request will be called back. If it is still not processed, it will be ignored.
- The return value of process_response is one of the Response object, the Request object, or throws an IgnoreRequest exception.
- It is returned as a Response object, and the low-priority middleware continues to execute process_response()
- It is returned as a Request object, and process_response() in the low-priority middleware will not be called, and a new request will be sent to the scheduler.
- Throws an IgnoreRequest exception and hands it to the request's callback() method for callback.
3. The return value of process_exception is one of None, Response object, and Request object. - The return is None, and the process_exception of the middleware in the low priority continues to be called.
- It is returned as a Response object, and the low-priority process_response continues to be called.
- Returned as a Request object, low-priority middleware is no longer called. Generate a new request to send to the scheduler.
use
In the above example, a middleware is defined.
- In the process_request method, the request header is randomly returned.
- In the process_response method, replace the status code 302 of the response body with 200.
- Then add to settings.py
scrapy native download middleware
DOWNLOADER_MIDDLEWARES_BASE = { # Engine side 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300, 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350, 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400, 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500, 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550, 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560, 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600, 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700, 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750, 'scrapy.downloadermiddlewares.stats.DownloaderStats': 850, 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900, # Downloader side }
Usage of Spider Middleware
Before the downloader sends the response body to the spider and after the spider generates the item or request, it will be processed by the spider middleware.
illustrate
core method
Each middleware defines one or four of the following methods:
- process_spide_input(response,spider)
- Parameters: response - the response body to be processed, spider - the corresponding spider
- Return value: None or throw an exception
- When it returns None, continue to call low-priority middleware.
- When an exception is thrown, lower priority middleware will not be executed. Call the request's errback(), the output of errback will be re-input to the middleware, call process_spider_output().
- process_spider_output(response,result,spider)
- Parameters: response - the response body to be processed, result - an iterable object containing the request or item object, which is the return value of the Spider.
- Return value: an iterable object that returns the request or item
- process_spider_exception(response,exception,spider)
- Parameters: response - the body of the response to be processed, exception - the exception to be thrown, spider - the spider that threw the exception
- Return value: None or (an iterable object containing the request or item)
- When it returns None, continue to call the low-priority intermediate to handle the exception.
- When returned as an iterable, call the low-priority process_spider_output().
- process_start_requests(start_request,spider)
- Parameters: start_request - the iterable object containing the request, spider - the spider object
- Return value: an iterable object containing the request.
use
class OffsiteMiddleware: def __init__(self, stats): self.stats = stats @classmethod def from_crawler(cls, crawler): o = cls(crawler.stats) crawler.signals.connect(o.spider_opened, signal=signals.spider_opened) return o def process_spider_output(self, response, result, spider): for x in result: if isinstance(x, Request): if x.dont_filter or self.should_follow(x, spider): yield x else: domain = urlparse_cached(x).hostname if domain and domain not in self.domains_seen: self.domains_seen.add(domain) logger.debug( "Filtered offsite request to %(domain)r: %(request)s", {'domain': domain, 'request': x}, extra={'spider': spider}) self.stats.inc_value('offsite/domains', spider=spider) self.stats.inc_value('offsite/filtered', spider=spider) else: yield x
scrapy原生Spider Middleware
SPIDER_MIDDLEWARES_BASE = { # Engine side 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500, 'scrapy.spidermiddlewares.referer.RefererMiddleware': 700, 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800, 'scrapy.spidermiddlewares.depth.DepthMiddleware': 900, # Spider side }
Usage of item Pipline
Main functions of item pipline:
- Clean HTML data
- Validate crawled data, check crawled fields
- Check for duplicates and discard duplicates
- Save the crawling results to the database
core method
In addition to the necessary process_item, there are several other more practical methods:
- open_spider(spider)
- close_spider(spider)
- from_crawler(cls,crawler)
Scrapy provides a Pipeline that handles downloads, including file downloads and image downloads. The principle of downloading files and pictures is the same as that of crawling pages, so the download process supports asynchronous and multi-threading, and the download is very efficient.
Built-in image downloader Pipline
class ImagesPipeline(FilesPipeline): """Abstract pipeline that implement the image thumbnail generation logic """ MEDIA_NAME = 'image' # Uppercase attributes kept for backward compatibility with code that subclasses # ImagesPipeline. They may be overridden by settings. MIN_WIDTH = 0 MIN_HEIGHT = 0 EXPIRES = 90 THUMBS = {} DEFAULT_IMAGES_URLS_FIELD = 'image_urls' DEFAULT_IMAGES_RESULT_FIELD = 'images'
DEFAULT_IMAGES_URLS_FIELD = 'image_urls'
DEFAULT_IMAGES_RESULT_FIELD = 'images'
def __init__(self, store_uri, download_func=None, settings=None): ...... if not hasattr(self, "IMAGES_RESULT_FIELD"): self.IMAGES_RESULT_FIELD = self.DEFAULT_IMAGES_RESULT_FIELD if not hasattr(self, "IMAGES_URLS_FIELD"): self.IMAGES_URLS_FIELD = self.DEFAULT_IMAGES_URLS_FIELD self.images_urls_field = settings.get( resolve('IMAGES_URLS_FIELD'), self.IMAGES_URLS_FIELD ) def get_media_requests(self, item, info): urls = ItemAdapter(item).get(self.images_urls_field, []) return [Request(u) for u in urls]
You can see that this Pipeline will read the image_urls field in item by default, and consider this field to be a list. In the get_media_requests method, iterate over the image_urls field and generate an iterable object of requests. to the scheduler request.
The ImagePipline class can be rewritten as needed:
from scrapy import Request from scrapy.pipelines.images import ImagesPipeline class ImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): yield Request(item['image_url'])
Here, get_media_requestsg is rewritten to directly take the image_url field in the item to initiate a request.
Similarly
class ImagesPipeline(ImagesPipeline): ... ... def file_path(self, request, response=None, info=None, *, item=None): url = request.url file_name = url.split("/")[-1] return file_name def item_completed(self, results, item, info): images_paths = [x['path'] for ok, x in results if ok] if not images_paths: item['image_url'] = None return item
Rewrite file_path to take part of the link address of the image as the file name. After rewriting item_competed to the image link request, if the download fails, modify the image_url field of item to None.
Finally add to settings.py
ITEM_PIPELINES = { 'tutorial.pipelines.ImagesPipeline': 301, 'tutorial.pipelines.ArticlePipeline': 301, 'tutorial.pipelines.MongoPipeline': 400, }
Cookies pool docking
Proxy pool docking
crawl template
View templates
Create spider in template form
scrapy genspider -t crawl china tech.china.com
rules, crawl attribute rules, contains a list of one or more Rule objects. Each Rule defines the action of crawling a website.
Define Rules
For different crawling objects, different parsing functions should be implemented.
class ChinaSpider(CrawlSpider): name = 'china' allowed_domains = ['tech.china.com'] start_urls = ['http://tech.china.com/articles'] rules = ( # Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), Rule(LinkExtractor(allow='article\/.*\.html', restrict_xpaths='//div[@id="left_side"]//div[@class="com_item"]'), callback='parse_item'), Rule(LinkExtractor(restrict_xpaths='//div[@id="pageStyle"]//a[contains(text(),"下一页")]')) ) def parse_item(self, response): loadder = ChinaLoader(item=NewsItem(), response=response) loadder.add_xpath('title', '//h1[@id="chan_newsTtile"]/text()') loadder.add_value('content', '新闻内容') yield loadder.load_item()
gerapy framework
Common commands
(You need to start the scrapyd service first, and you need to install the scrapyd library (pip install scrapyd
- pip install gerapy
- gerapy init
- gerapy migrate
- gerapy createsuperuser
- create user
- gerapy runserver