When using scrapy crawling pictures today and found this error:
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: //images2015.cnblogs.com/news_topic/20161020185521154-1185360701.png
Obviously, scrapy in the pipeline in the process of downloading the picture, does not recognize the target url.
web to //
target the link at the beginning of "relative agreement", which is in front eliminating the http: or https: words, the benefits of doing so is a file browser can be automatically loaded on CDN hosting your site according to the protocol used .
Crawling linked files during expression should be changed to the correct protocol relative url by positive goals
before fixing:
front_image = response.meta.get("front_image_url", "")
article_item["front_image_url"] = [front_image] # pipeline下载图片一定要传list
Modified:
import re
front_image = response.meta.get("front_image_url", "")
if re.match("^(//).*", front_image):
front_image = "https:" + front_image
article_item["front_image_url"] = [front_image] # pipeline下载图片一定要传list