[File] scrapy Scrapy pictures and download a custom picture pendant functional blocks

1. Browser head reptile camouflage, set up a proxy Ip

Or disposed in a custom ImagesPipeline USER_AGENT in setting.py. Pictures principle equivalent to a pipeline middleware, middleware response to the request sent to intercept them, and then be modified further modification.
such as:

#例如,在自定义管道上边添加
    def get_media_requests(self, item, info):
 
        image_url = item["pic_url"]
        # headers是请求头主要是防反爬虫
        header = {
            "referer":item["referer"],
            "user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
                  }
        yield scrapy.Request(image_url, headers=header)

IP Agent following code is provided, intermediate between the use of the downloaded schedule here.

import  random
PROXY_http = [
    '153.180.102.104:80',
    '195.208.131.189:56055',
]
PROXY_https = [
    '120.83.49.90:9000',
    '95.189.112.214:35508',
]
class MovieproDownloaderMiddleware(object):
    #拦截正常的请求,参数request就是拦截到的请求对象
    def process_request(self, request, spider):
        #实现:将拦截到的请求尽可能多的设定成不同的请求载体身份标识
        request.headers['User-Agent'] = random.choice(user_agent_list)
        #代理操作
        if request.url.split(':')[0] == 'http':
            request.meta['proxy'] = 'http://'+random.choice(PROXY_http) #http://ip:port
        else:
            request.meta['proxy'] = 'https://' + random.choice(PROXY_https)  # http://ip:port
        return None
    #拦截响应:参数response就是拦截到的响应
    def process_response(self, request, response, spider):
        return response
    #拦截发生异常的请求
    def process_exception(self, request, exception, spider):
        #拦截到异常的请求然后对其进行修正,然后重新进行请求发送
        # 代理操作
        if request.url.split(':')[0] == 'http':
            request.meta['proxy'] = 'http://' + random.choice(PROXY_http)  # http://ip:port
        else:
            request.meta['proxy'] = 'https://' + random.choice(PROXY_https)  # http://ip:port

        return request  #将修正之后的请求进行重新发送

2. Set crawler access delay, add the following code setting file:

DOWNLOAD_DELAY = 3

3. how to create a download from sub-directory, in the custom picture file to download pipeline

    def item_completed(self, results, item, info):
        # image_path 得到的是保存在full目录下用哈希值命名的图片列表路径
        # image_path = ['full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg']
        image_path = [x["path"] for ok,x in results if ok]
 
        # 定义分类保存的路径
        # img_path 得到的是settings中定义的路径+套图名称
        new_path = '%s\%s'%(self.IMAGES_STORE,item["pic_title"])
 
        # 如果目录不存在,则创建目录
        if not os.path.exists(new_path):
            os.mkdir(new_path)

4. Since the new field to store the definition file download path, as is the default scrapy images_url, add the following code setting file:

IMAGES_URLS_FIELD = '你自定义的图像下载地址'
IMAGES_RESULT_FIELD = '你自定义下载的图片名称'

5. how name the new file name, and transferred to the new route, the pipeline in the picture download a file in a custom:

Implementation (1)

from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.project import get_project_settings
import scrapy
import os
import shutil
    def item_completed(self, results, item, info):
        # 将文件从默认下路路径移动到指定路径下
        # self.IMAGES_STORE + "\\" + image_path[0] 就是原路径 G:\Fa24\full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg
        # image_path[0][image_path[0].find("full\\")+6:] 把原目录'full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg'中的“full/”去掉#,得到的是哈希值命名的图片名
        pic_name = image_path[0][image_path[0].find("full\\")+6:] 
        old_path = self.IMAGES_STORE + "\\" + image_path[0]
        #把文件从默认路径转移的新的路径
        shutil.move(old_path, new_path + "\\" + pic_name)
        # 哈希值的名字太长太长了,改一下名吧
        os.rename(new_path + "\\" + pic_name,new_path + "\\" + item["pic_name"])
        # 把图片路径传回给item
        item["pic_url"] = new_path + "\\" + item["pic_name"]
        # item["pic_url"] = new_path + "\\" + image_path[0][image_path[0].find("full\\")+6:]

Note that the last submitted image_url submitting a storage location of the file.

Submission (2) self file_path method defined in the pipeline, which contains a method to wash the image name string garbled, how to change the name to the picture

  def get_media_requests(self, item, info):
#这里要把image_urls字段提交上去,用meta来提交
        image_url = item['image_urls']
        yield scrapy.Request(image_url,meta={'name':item['image_name']})

    def file_path(self, request, response=None, info=None):
        name = request.meta['name'] # 接收上面meta传递过来的图片名称                                       
        name = re.sub(r'[?\\*|“<>:/]', '', name) # 过滤windows字符串,不经过这么一个步骤,你会发现有乱码或无法下载
        filename= name +'.jpg' #添加图片后缀名
        return filename

Submission (3) in the py file is automatically added file_path document classification method, this must write get_media_requests

    def file_path(self, request, response=None, info=None):
        item = request.meta['item']
        title = item['name']
        image_guid = request.url.split('/')[-1]
        filename = 'full/{0}/{1}'.format(title, image_guid)
        return filename
 
    def get_media_requests(self, item, info):
        """
        :param item: spider.py中返回的item
        :param info:
        :return:
        """
        for img_url in item['imgs_url']:
            referer = item['url']
            yield Request(img_url, meta={'item': item,
                                         'referer': referer})

6. Picture hand chain

Specifically, the establishment of a field in item.py, referer field to store special

# 反爬虫用的反重定向地址
referer = scrapy.Field()

The field is then stored in the file crawler, get_media_requests last method in the pipeline in the submitted document referer field up.

    def get_media_requests(self, item, info):
 
        image_url = item["pic_url"]
        # headers是请求头主要是防反爬虫
        header = {
            "referer":item["referer"],
            "user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
                  }
        yield scrapy.Request(image_url, headers=header)

7. how to set the starting position of the image to download a file folder, change the start of the pipeline in the file directory is a sub-directory under the absolute position. Method is setting.py file to define the IMAGES_STORE. Add the following code file setting.py

import os
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(project_dir, 'images')

8. When testing only outputs error information, and outputs to a specific file, add the following fields setting.py file:

LOG_LEVEL = 'ERROR'
#将日志信息存储到指定文件中,不在终端输出
LOG_FILE = 'log.txt'

9.

Published 56 original articles · won praise 2 · views 30000 +

Guess you like

Origin blog.csdn.net/fan13938409755/article/details/104819176