scrapy advanced use

Pass the request parameters (depth crawl)
- Crawling data does not exist in the same page
- How to achieve mass participation request
  - Request (url, callback, meta = {}): meta dictionary can be passed to the callback function parameters, for transmitting multi-item
  - callback accept the item: get dictionary with response.meta
scrapy five core components
- Engine: Engine received data stream type is determined to know the next operation. Triggering transaction (core framework)
- Spider: url of the extracted encapsulated request object, the request object to the engine, the engine to the scheduler, after receiving the Response, the parsed data stored in it to the item, to the engine through the pipe for storage
- Scheduler: receiving the request object, after filtration pressure into the queue, and returns when the engine is requested again, remove a request object from the queue and returned to the engine, the engine and passed downloader to download data from the Internet Download , encapsulated in response, response to the engine, the engine response to a spider, for data analysis
  - The request object request object repeated filtration, remove the duplicate: Filters
  - Queue: The request object stored in, to remove the
- Pipeline: persistent storage engine to accept data from the item and the item
- Download: Accept request object by scheduler engine returns, to download data and returns the response (scrapy downloader is based on the twisted efficient asynchronous model)
Middleware
- Download Middleware
  - Batch intercept all requests and responses
    - Why intercepts the request
      - Request header information (UA) the tamper
      - Acting operations
    - Why intercepted response
      - Tamper-responsive Data
      - Tamper-responsive objects, object culling condition is not satisfied or in response to qualified objects
- Reptile Middleware

Netease news crawl data

 Reptile File 
 Import scrapy 
 Import SYS 
 from the Selenium Import webdriver 
 from newsPro.items Import NewsproItem class NewsSpider (scrapy.Spider):    sys.path.append ( 'E: \ ES \ IP \ chromedriver.exe') # add a search path    name = 'News'    # allowed_domains = [ 'www.xxx.com']    start_urls = [ 'https://news.163.com/']    Bro = webdriver.Chrome (executable_path = r'E: \ ES \ the IP \ chromedriver. exe ') # using selenium obtain dynamic data for correction in response middleware replace    son_urls = []    # parsing each section corresponding URL    DEF the parse (Self, response):        li_list response.xpath = (' // * [@ ID = "index2016_wrap"] / div [. 1] / div [2] / div [2] / div [2] / div [2] / div / UL / Li ')        # indexs = [3,4,6 ,7]       indexs = [3]
 
 
 
 
 
 
 
 
 
 
 
 
 
        in indexs index for: 
            . son_url li_list = [index] .xpath ( './ A / @ the href') extract_first () 
            self.son_urls.append (son_url) 
            # url for each page request transmitted 
            yield scrapy.Request (son_url , the callback = self.parse_son_page)    # url parse each sub-data    DEF parse_son_page (Self, response):        # current response can not be acquired dynamically loaded news data        Item = NewsproItem ()        div_list response.xpath = ( '/ HTML / body / div / div [. 3] / div [. 4] / div [. 1] / div / div / UL / Li / div / div ')        for div in div_list:            detail_url = div.xpath (' ./ div / div [ . 1] / H3 / A / @ the href ') extract_first ().            title = div.xpath (' ./ div / div [. 1] / H3 / A / text () ') extract_first ().            IF title:
 
 
 
 
 
 
 
 
 
 
                Item [ 'title'] = title 
            IF detail_url: 
                the yield scrapy.Request (detail_url, the callback = self.parse_detail, Meta = { 'Item': Item}) # pass the request parameter    # Analytic url in each news content details    def parse_detail (Self, Response):        Item response.meta = [ 'Item']        . response.xpath Content = ( '// * [@ ID = "endText"] // text ()') Extract () needs to be matched to # a plurality of results, without extract_first () the        Print (Content)        Content = '' .join (Content)        Item [ 'Content'] = Content        Print (Item [ 'title'], Item [ 'Content'])        the yield Item

 中间件
 # -*- coding: utf-8 -*-
 
 # Define here the models for your spider middleware
 #
 # See documentation in:
 # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 import time
 from scrapy import signals
 from scrapy.http import HtmlResponse
 from time import sleep
 class NewsproDownloaderMiddleware(object):
 
    def process_request(self, request, spider):
 
        return None
    # 拦截响应对象，将不满足的响应对象修正
    def process_response(self, request, response, spider):
        bro = spider.bro
        son_urls = spider.son_urls
        if request.url in son_urls:
            bro.get(request.url)
            sleep(2)
            page_text = bro.page_source
            new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)
            return new_response
        else:
            return response
 
    def process_exception(self, request, exception, spider):
 
        pass

Guess you like