scrapy picture data crawling

Requirements: Master crawling material in high-quality pictures


A. Data analysis (image address)


Xpath property value by parsing out the image src.
Only need to attribute value of the img src parse submitted to the pipeline, the pipeline will be carried out on request src picture get the picture

spider files

class ImgSpider(scrapy.Spider):
    name = 'img'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://sc.chinaz.com/tupian/']

    def parse(self, response):
        src_list = response.xpath ( ' // div [@ the above mentioned id = "Container"] / div ' )
         # Print (src_list) 
        for src_item in src_list:
             # images lazy loaded (when browser to slide into a picture src2 src attribute properties ) 
            # Scrapy not slide to the picture, the use src2 properties (pseudo-attribute) 
            src_content src_item.xpath = ( ' ./div/a/img/@src2 ' ) .extract_first ()
             Print (src_content)
            item = ImgsproItem()
            item['src'] = src_content

            yield item


II. ImagesPipeline based on a custom class of a pipe in the pipe file

3-implemented method parent class
-get_media_requests
-file_path
-item_completed

pipeline file:

import scrapy
from scrapy.pipelines.images import ImagesPipeline

class ImgsproPipeline(object):
    item = None
    def process_item(self, item, spider):
        return item


# ImagesPipeline for file downloads pipe class, the download process supports both asynchronous and multi-threaded 
class ImgPipeLine (ImagesPipeline):
     # of the item in the picture requested operation 
    DEF get_media_requests (Self, item, info):

        yield scrapy.Request (Item [ ' src ' ])
     # custom picture name 
    DEF file_path (Self, Request, the Response = None, info = None):
        imgName = request.url.split ( ' / ' ) [-. 1 ]
         return imgName
     #
     DEF item_completed (Self, Results, Item, info):
         return Item   # piped to the next class to be executed

III. Specified pictures stored in the configuration file directory

'. / Img' IMAGES_STORE = 

Profiles:
# USER_AGENT = 'firstBlood (+ HTTP: //www.yourdomain.com)' 
USER_AGENT = ' Mozilla / 5.0 (Macintosh; Intel Mac OS the X-10_12_0) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 68.0.3440.106 Safari / 537.36 '   # disguise the identity of the carrier request 
# Obey a robots.txt rules 
# ROBOTSTXT_OBEY = True 
ROBOTSTXT_OBEY = False   # negligible or non-compliance with robots protocol 
# display only the specified types of log information 
, LOG_LEVEL, = ' ERROR '

# Represents the final image stored in the directory 
IMAGES_STORE = ' ./imgs ' 
# the Configure maximum Concurrent Requests Performed by Scrapy (default: 16) 
# CONCURRENT_REQUESTS = 32


# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'imgsPro.pipelines.ImgsproPipeline': 300,
   'imgsPro.pipelines.ImgPipeLine': 200,
}

 

Guess you like

Origin www.cnblogs.com/xiao-apple36/p/12623211.html