Scrapy frame - crawling paging data per link extractor CrawlSpider

The station data crawling

Most sites display data paging operations are carried out, then all the page numbers corresponding page data is crawling the entire station data in the reptile crawling. Scrapy crawl based on how full the station data it?

Each of corresponding page number is stored in the starting url url list (start_urls) in the crawler file. (Not recommended)

Request initiation request using the manual method. (reluctantly)

Use crawlSpider link extractor (recommended)

Requirements: The Encyclopedia of embarrassments all pages of the author and the piece of content data crawling cut persistent storage

# - * - Coding: UTF-. 8 - * - 
Import Scrapy
 from scrapy.linkextractors Import LinkExtractor
 from scrapy.spiders Import CrawlSpider, the Rule 


class QiubaiSpider (CrawlSpider): 
    name = ' Qiubai ' 
    # allowed_domains = [ 'www.qiubai.cn' ] 
    start_urls = [ ' https://www.qiushibaike.com/8hr/page/1/ ' ]
     # define rules for each page, for extracting links 
    link = LinkExtractor (R & lt the allow = ' / 8hr / Page / \ + D / ' ) 
    the rules = (
        The Rule (Link, the callback= ' Parse_item ' , Follow = True), # Fallow extract all p = True 
    ) 

    DEF parse_item (Self, Response): 
    # begins parsing the data needed
Print (Response)

 2. Other measures (own construction url format of each page)

# - * - Coding: UTF-. 8 - * - 
Import Scrapy
 from qiushibaike.items Import QiushibaikeItem
 # scrapy.http Import the Request 
class QiushiSpider (scrapy.Spider): 
  name = ' Qiushi ' 
  allowed_domains = [ ' www.qiushibaike.com ' ] 
  start_urls = [ ' https://www.qiushibaike.com/text/ ' ]
  # crawling multiple pages 
  pageNum. 1 = # starting page 
  URL = ' https://www.qiushibaike.com/text/page/%s/ '  #每页的url
def parse(self, response):
  div_list=response.xpath('//*[@id="content-left"]/div')
  for div in div_list:
    #//*[@id="qiushi_tag_120996995"]/div[1]/a[2]/h2
    author=div.xpath('.//div[@class="author clearfix"]//h2/text()').extract_first()
    author=author.strip('\n')
    content=div.xpath('.//div[@class="content"]/span/text()').extract_first()
    content=content.strip('\n' ) 
    Item = QiushibaikeItem () 
    item [ ' author ' ] = author 
    item [ ' Content ' ] = Content
    the yield item # submitted to the pipeline for persistence item 
    # crawling all page data 
    IF self.pageNum <= 13 is: # total climb take 13 (of 13) 
      self.pageNum. 1 + = 
      url = the format (self.url% self.pageNum)
      # recursive crawling data: callback parameter is the callback function (url the request, the corresponding data obtained continue parse parse performed), the recursive function call parse 
      the yield scrapy.Request (URL = URL, the callback = self.parse)

 

 

Guess you like

Origin www.cnblogs.com/cou1d/p/12643155.html