Requirements: Master crawling material in high-quality pictures
A. Data analysis (image address)
Xpath property value by parsing out the image src.
Only need to attribute value of the img src parse submitted to the pipeline, the pipeline will be carried out on request src picture get the picture
spider files
class ImgSpider(scrapy.Spider): name = 'img' # allowed_domains = ['www.xxx.com'] start_urls = ['http://sc.chinaz.com/tupian/'] def parse(self, response): src_list = response.xpath ( ' // div [@ the above mentioned id = "Container"] / div ' ) # Print (src_list) for src_item in src_list: # images lazy loaded (when browser to slide into a picture src2 src attribute properties ) # Scrapy not slide to the picture, the use src2 properties (pseudo-attribute) src_content src_item.xpath = ( ' ./div/a/img/@src2 ' ) .extract_first () Print (src_content) item = ImgsproItem() item['src'] = src_content yield item
II. ImagesPipeline based on a custom class of a pipe in the pipe file
3-implemented method parent class
-get_media_requests
-file_path
-item_completed
pipeline file:
import scrapy from scrapy.pipelines.images import ImagesPipeline class ImgsproPipeline(object): item = None def process_item(self, item, spider): return item # ImagesPipeline for file downloads pipe class, the download process supports both asynchronous and multi-threaded class ImgPipeLine (ImagesPipeline): # of the item in the picture requested operation DEF get_media_requests (Self, item, info): yield scrapy.Request (Item [ ' src ' ]) # custom picture name DEF file_path (Self, Request, the Response = None, info = None): imgName = request.url.split ( ' / ' ) [-. 1 ] return imgName # DEF item_completed (Self, Results, Item, info): return Item # piped to the next class to be executed
III. Specified pictures stored in the configuration file directory
'. / Img' IMAGES_STORE =
Profiles:
# USER_AGENT = 'firstBlood (+ HTTP: //www.yourdomain.com)' USER_AGENT = ' Mozilla / 5.0 (Macintosh; Intel Mac OS the X-10_12_0) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 68.0.3440.106 Safari / 537.36 ' # disguise the identity of the carrier request # Obey a robots.txt rules # ROBOTSTXT_OBEY = True ROBOTSTXT_OBEY = False # negligible or non-compliance with robots protocol # display only the specified types of log information , LOG_LEVEL, = ' ERROR ' # Represents the final image stored in the directory IMAGES_STORE = ' ./imgs ' # the Configure maximum Concurrent Requests Performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32 # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'imgsPro.pipelines.ImgsproPipeline': 300, 'imgsPro.pipelines.ImgPipeLine': 200, }