The station data crawling
Most sites display data paging operations are carried out, then all the page numbers corresponding page data is crawling the entire station data in the reptile crawling. Scrapy crawl based on how full the station data it?
Each of corresponding page number is stored in the starting url url list (start_urls) in the crawler file. (Not recommended)
Request initiation request using the manual method. (reluctantly)
Use crawlSpider link extractor (recommended)
Requirements: The Encyclopedia of embarrassments all pages of the author and the piece of content data crawling cut persistent storage
# - * - Coding: UTF-. 8 - * - Import Scrapy from scrapy.linkextractors Import LinkExtractor from scrapy.spiders Import CrawlSpider, the Rule class QiubaiSpider (CrawlSpider): name = ' Qiubai ' # allowed_domains = [ 'www.qiubai.cn' ] start_urls = [ ' https://www.qiushibaike.com/8hr/page/1/ ' ] # define rules for each page, for extracting links link = LinkExtractor (R & lt the allow = ' / 8hr / Page / \ + D / ' ) the rules = ( The Rule (Link, the callback= ' Parse_item ' , Follow = True), # Fallow extract all p = True ) DEF parse_item (Self, Response):
# begins parsing the data needed Print (Response)
2. Other measures (own construction url format of each page)
# - * - Coding: UTF-. 8 - * - Import Scrapy from qiushibaike.items Import QiushibaikeItem # scrapy.http Import the Request class QiushiSpider (scrapy.Spider): name = ' Qiushi ' allowed_domains = [ ' www.qiushibaike.com ' ] start_urls = [ ' https://www.qiushibaike.com/text/ ' ] # crawling multiple pages pageNum. 1 = # starting page URL = ' https://www.qiushibaike.com/text/page/%s/ ' #每页的url def parse(self, response): div_list=response.xpath('//*[@id="content-left"]/div') for div in div_list: #//*[@id="qiushi_tag_120996995"]/div[1]/a[2]/h2 author=div.xpath('.//div[@class="author clearfix"]//h2/text()').extract_first() author=author.strip('\n') content=div.xpath('.//div[@class="content"]/span/text()').extract_first() content=content.strip('\n' ) Item = QiushibaikeItem () item [ ' author ' ] = author item [ ' Content ' ] = Content the yield item # submitted to the pipeline for persistence item # crawling all page data IF self.pageNum <= 13 is: # total climb take 13 (of 13) self.pageNum. 1 + = url = the format (self.url% self.pageNum) # recursive crawling data: callback parameter is the callback function (url the request, the corresponding data obtained continue parse parse performed), the recursive function call parse the yield scrapy.Request (URL = URL, the callback = self.parse)