1. Browser head reptile camouflage, set up a proxy Ip
Or disposed in a custom ImagesPipeline USER_AGENT in setting.py. Pictures principle equivalent to a pipeline middleware, middleware response to the request sent to intercept them, and then be modified further modification.
such as:
#例如,在自定义管道上边添加
def get_media_requests(self, item, info):
image_url = item["pic_url"]
# headers是请求头主要是防反爬虫
header = {
"referer":item["referer"],
"user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
}
yield scrapy.Request(image_url, headers=header)
IP Agent following code is provided, intermediate between the use of the downloaded schedule here.
import random
PROXY_http = [
'153.180.102.104:80',
'195.208.131.189:56055',
]
PROXY_https = [
'120.83.49.90:9000',
'95.189.112.214:35508',
]
class MovieproDownloaderMiddleware(object):
#拦截正常的请求,参数request就是拦截到的请求对象
def process_request(self, request, spider):
#实现:将拦截到的请求尽可能多的设定成不同的请求载体身份标识
request.headers['User-Agent'] = random.choice(user_agent_list)
#代理操作
if request.url.split(':')[0] == 'http':
request.meta['proxy'] = 'http://'+random.choice(PROXY_http) #http://ip:port
else:
request.meta['proxy'] = 'https://' + random.choice(PROXY_https) # http://ip:port
return None
#拦截响应:参数response就是拦截到的响应
def process_response(self, request, response, spider):
return response
#拦截发生异常的请求
def process_exception(self, request, exception, spider):
#拦截到异常的请求然后对其进行修正,然后重新进行请求发送
# 代理操作
if request.url.split(':')[0] == 'http':
request.meta['proxy'] = 'http://' + random.choice(PROXY_http) # http://ip:port
else:
request.meta['proxy'] = 'https://' + random.choice(PROXY_https) # http://ip:port
return request #将修正之后的请求进行重新发送
2. Set crawler access delay, add the following code setting file:
DOWNLOAD_DELAY = 3
3. how to create a download from sub-directory, in the custom picture file to download pipeline
def item_completed(self, results, item, info):
# image_path 得到的是保存在full目录下用哈希值命名的图片列表路径
# image_path = ['full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg']
image_path = [x["path"] for ok,x in results if ok]
# 定义分类保存的路径
# img_path 得到的是settings中定义的路径+套图名称
new_path = '%s\%s'%(self.IMAGES_STORE,item["pic_title"])
# 如果目录不存在,则创建目录
if not os.path.exists(new_path):
os.mkdir(new_path)
4. Since the new field to store the definition file download path, as is the default scrapy images_url, add the following code setting file:
IMAGES_URLS_FIELD = '你自定义的图像下载地址'
IMAGES_RESULT_FIELD = '你自定义下载的图片名称'
5. how name the new file name, and transferred to the new route, the pipeline in the picture download a file in a custom:
Implementation (1)
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.project import get_project_settings
import scrapy
import os
import shutil
def item_completed(self, results, item, info):
# 将文件从默认下路路径移动到指定路径下
# self.IMAGES_STORE + "\\" + image_path[0] 就是原路径 G:\Fa24\full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg
# image_path[0][image_path[0].find("full\\")+6:] 把原目录'full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg'中的“full/”去掉#,得到的是哈希值命名的图片名
pic_name = image_path[0][image_path[0].find("full\\")+6:]
old_path = self.IMAGES_STORE + "\\" + image_path[0]
#把文件从默认路径转移的新的路径
shutil.move(old_path, new_path + "\\" + pic_name)
# 哈希值的名字太长太长了,改一下名吧
os.rename(new_path + "\\" + pic_name,new_path + "\\" + item["pic_name"])
# 把图片路径传回给item
item["pic_url"] = new_path + "\\" + item["pic_name"]
# item["pic_url"] = new_path + "\\" + image_path[0][image_path[0].find("full\\")+6:]
Note that the last submitted image_url submitting a storage location of the file.
Submission (2) self file_path method defined in the pipeline, which contains a method to wash the image name string garbled, how to change the name to the picture
def get_media_requests(self, item, info):
#这里要把image_urls字段提交上去,用meta来提交
image_url = item['image_urls']
yield scrapy.Request(image_url,meta={'name':item['image_name']})
def file_path(self, request, response=None, info=None):
name = request.meta['name'] # 接收上面meta传递过来的图片名称
name = re.sub(r'[?\\*|“<>:/]', '', name) # 过滤windows字符串,不经过这么一个步骤,你会发现有乱码或无法下载
filename= name +'.jpg' #添加图片后缀名
return filename
Submission (3) in the py file is automatically added file_path document classification method, this must write get_media_requests
def file_path(self, request, response=None, info=None):
item = request.meta['item']
title = item['name']
image_guid = request.url.split('/')[-1]
filename = 'full/{0}/{1}'.format(title, image_guid)
return filename
def get_media_requests(self, item, info):
"""
:param item: spider.py中返回的item
:param info:
:return:
"""
for img_url in item['imgs_url']:
referer = item['url']
yield Request(img_url, meta={'item': item,
'referer': referer})
6. Picture hand chain
Specifically, the establishment of a field in item.py, referer field to store special
# 反爬虫用的反重定向地址
referer = scrapy.Field()
The field is then stored in the file crawler, get_media_requests last method in the pipeline in the submitted document referer field up.
def get_media_requests(self, item, info):
image_url = item["pic_url"]
# headers是请求头主要是防反爬虫
header = {
"referer":item["referer"],
"user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
}
yield scrapy.Request(image_url, headers=header)
7. how to set the starting position of the image to download a file folder, change the start of the pipeline in the file directory is a sub-directory under the absolute position. Method is setting.py file to define the IMAGES_STORE. Add the following code file setting.py
import os
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(project_dir, 'images')
8. When testing only outputs error information, and outputs to a specific file, add the following fields setting.py file:
LOG_LEVEL = 'ERROR'
#将日志信息存储到指定文件中,不在终端输出
LOG_FILE = 'log.txt'
9.