scrapy advanced use

  1. Pass the request parameters (depth crawl)

    • Crawling data does not exist in the same page

    • How to achieve mass participation request

      • Request (url, callback, meta = {}): meta dictionary can be passed to the callback function parameters, for transmitting multi-item

      • callback accept the item: get dictionary with response.meta

  2. scrapy five core components

    • Engine: Engine received data stream type is determined to know the next operation. Triggering transaction (core framework)

    • Spider: url of the extracted encapsulated request object, the request object to the engine, the engine to the scheduler, after receiving the Response, the parsed data stored in it to the item, to the engine through the pipe for storage

    • Scheduler: receiving the request object, after filtration pressure into the queue, and returns when the engine is requested again, remove a request object from the queue and returned to the engine, the engine and passed downloader to download data from the Internet Download , encapsulated in response, response to the engine, the engine response to a spider, for data analysis

      • The request object request object repeated filtration, remove the duplicate: Filters

      • Queue: The request object stored in, to remove the

    • Pipeline: persistent storage engine to accept data from the item and the item

    • Download: Accept request object by scheduler engine returns, to download data and returns the response (scrapy downloader is based on the twisted efficient asynchronous model)

  3. Middleware

    • Download Middleware

      • Batch intercept all requests and responses

        • Why intercepts the request

          • Request header information (UA) the tamper

          • Acting operations

        • Why intercepted response

          • Tamper-responsive Data

          • Tamper-responsive objects, object culling condition is not satisfied or in response to qualified objects

      •  

    • Reptile Middleware

  4. Netease news crawl data

    •  Reptile File 
       Import scrapy
       Import SYS
       from the Selenium Import webdriver
       from newsPro.items Import NewsproItem class NewsSpider (scrapy.Spider):   sys.path.append ( 'E: \ ES \ IP \ chromedriver.exe') # add a search path   name = 'News'   # allowed_domains = [ 'www.xxx.com']   start_urls = [ 'https://news.163.com/']   Bro = webdriver.Chrome (executable_path = r'E: \ ES \ the IP \ chromedriver. exe ') # using selenium obtain dynamic data for correction in response middleware replace   son_urls = []   # parsing each section corresponding URL   DEF the parse (Self, response):       li_list response.xpath = (' // * [@ ID = "index2016_wrap"] / div [. 1] / div [2] / div [2] / div [2] / div [2] / div / UL / Li ')       # indexs = [3,4,6 ,7]       indexs = [3]
       
       
       
       
       
       
       
       
       
       
       
       
       
              in indexs index for:
                  . son_url li_list = [index] .xpath ( './ A / @ the href') extract_first ()
                  self.son_urls.append (son_url)
                  # url for each page request transmitted
                  yield scrapy.Request (son_url , the callback = self.parse_son_page)   # url parse each sub-data   DEF parse_son_page (Self, response):       # current response can not be acquired dynamically loaded news data       Item = NewsproItem ()       div_list response.xpath = ( '/ HTML / body / div / div [. 3] / div [. 4] / div [. 1] / div / div / UL / Li / div / div ')       for div in div_list:           detail_url = div.xpath (' ./ div / div [ . 1] / H3 / A / @ the href ') extract_first ().           title = div.xpath (' ./ div / div [. 1] / H3 / A / text () ') extract_first ().           IF title:
       
       
       
       
       
       
       
       
       
       
                      Item [ 'title'] = title
                  IF detail_url:
                      the yield scrapy.Request (detail_url, the callback = self.parse_detail, Meta = { 'Item': Item}) # pass the request parameter   # Analytic url in each news content details   def parse_detail (Self, Response):       Item response.meta = [ 'Item']       . response.xpath Content = ( '// * [@ ID = "endText"] // text ()') Extract () needs to be matched to # a plurality of results, without extract_first () the       Print (Content)       Content = '' .join (Content)       Item [ 'Content'] = Content       Print (Item [ 'title'], Item [ 'Content'])       the yield Item
           
       
       
       
       
       
       
       
       
       
    •  中间件
       # -*- coding: utf-8 -*-
       
       # Define here the models for your spider middleware
       #
       # See documentation in:
       # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
       import time
       from scrapy import signals
       from scrapy.http import HtmlResponse
       from time import sleep
       class NewsproDownloaderMiddleware(object):
       
          def process_request(self, request, spider):
       
              return None
          # 拦截响应对象,将不满足的响应对象修正
          def process_response(self, request, response, spider):
              bro = spider.bro
              son_urls = spider.son_urls
              if request.url in son_urls:
                  bro.get(request.url)
                  sleep(2)
                  page_text = bro.page_source
                  new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)
                  return new_response
              else:
                  return response
       
          def process_exception(self, request, exception, spider):
       
              pass

Guess you like

Origin www.cnblogs.com/W-Y-C/p/12577778.html