Based on the station data crawlspider crawling
1.spider subclass
2. Use Process
Create a file-based CrawlSpider reptile scrapy genspider -t crawl spidername
import scrapy from scrapy.spider.crawl import CrawlSpider, Rule fromm scrapy.linkextractors import LinkExtractor class myspider(CrawlSpider): name = 'pra_crawlspider' start_urls = [ 'http://pic.netbian.com/ ] rules = [ # Instantiate a Rule (Rule parser) objects Rule(LinkExtractor(restrict_xpaths='//div[@class="page"]), callback="parse_item", follow=True) ] def parse_item( self, response ) : imgs = response.xpath( '//div[@class="slist"]//img ) for img in imgs: print(img.xpaht('./@src')).extract_first() print(img.xpath('./@alt')).extract_first()
After running the spider, start_url given access to the first page, the response object is returned to the parse methods res, url parse method extracts from the res according to rules specified in the rule continues to send the request, the response object passed to the callback res designated function to deal with.