Preface
Is the liver
Instance
Process and technical point analysis
- Create a new CHAHUA project with China Illustration Network as the target website, the name of the chahua crawler, and the start.py file as the execution file
- settings.py (protocol False, request header, pipeline, imageastore)
- chahua.py
- pipeline.py
- items.py
Key theory
1.Rule, Link Extractors are mostly used for crawling of the whole station
Rule defines the rules for extracting links.
Follow is a Boolean value that specifies whether the links extracted from the response according to this rule need to be followed up. If callback is None, follow is set to True by default, otherwise it is set to False by default.
When follow is True, the crawler will take out the url that meets the rules from the obtained response and crawl it again. If there is still a url that meets the rules in the response that is crawled this time, it will crawl again, and loop infinitely until it does not exist. The URL that meets the rules.
When follow is False, the crawler will only take out the url that meets the rules from the response of start_urls and request it.
2. LinkExtractor used alone
Can be used to extract the complete url
Code example
chahua.py
1. Import
from scrapy.spiders.crawl import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
2.ruel formulation
start_urls = ['http://chahua.org/']
rules = {
# Rule(LinkExtractor(allow=r"http://www.chahua.org/"), follow=False,),
Rule(LinkExtractor(allow=r"http://www.chahua.org/drawn/detail.php?id=554887&hid=3"), follow=False,callback="parse_detail")
}
I encountered a problem here. There is also a callback parsing function for formulating specific page rules, but it cannot return the desired result.
As shown in the figure , I ca
n’t solve it for a long time. I use crawlspider to find the formulated content. I have to turn my head and create a new ZCOOL project with ZCOOL as the target website. The name of zcool crawler is used for the next step of picture downloading and learning, and it can’t be hanged on the internet by illustration
3.zcool.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders.crawl import CrawlSpider,Rule
from scrapy.linkextractors import LinkExtractor
from ..items import ImagedownloadItem
class ZcoolSpider(CrawlSpider):
name = 'zcool'
allowed_domains = ['zcool.com.cn']
# start_urls = ['http://zcool.com.cn/']
start_urls = ['https://www.zcool.com.cn/discover/0!0!0!0!0!!!!2!0!1']
rules = {
# 翻页的url
Rule(LinkExtractor(allow=r'.+0!0!0!0!0!!!!2!0!\d+'),follow=True),
# 详情页面的url
Rule(LinkExtractor(allow=r".+/work/.+html"),follow=False,callback="parse_detail")}
def parse_detail(self, response):
image_urls = response.xpath("//div[@class='work-show-box']//img/@src").getall()
title_list = response.xpath("//div[@class='details-contitle-box']/h2/text()").getall()
title = "".join(title_list).strip()
print(title)
item = ImagedownloadItem(title=title,image_urls=image_urls)
yield item
The saved fields must have imge_urls and images
import from …items import ImagedownloadItem before writing items.py
import scrapy
class ImagedownloadItem(scrapy.Item):
title = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
Note that this is ImagedownloadItem
4. After finishing writing zcool.py, enter the next difficulty ITEM_PIPELINES
Open the required pipeline in settings.py for saving data
ITEM_PIPELINES = {
#'imagedownload.pipelines.ImagedownloadPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline':1
}
Dedicated to the image downloading pipeline.
Then set the image download path. The
next problem in images_store is to get the image folder path of this project
through the os module, first find the path of your own settings.py, then find the imagedownload folder path, and finally find the images The path
os.path.dirname ( file ) refers to the directory where the current file is obtained.
os.path.dirname (os.path.dirname ( file )) refers to the upper-level directory of the directory where the current file is obtained
and then spliced with images, using os.path, join
os.path.join(os.path.dirname(os.path.dirname( file )),'images')
pipeline.py is like this
class ImagedownloadPipeline(object):
def process_item(self,item,spider):
return item
Final Results
Questions and summary
1. Add the domain domain name
2. Install the pillow library
3. Overridden settings report an error, the iron is that the settings file or the crawler is wrong, the wrong is wrong, the pot is back by itself o(╥﹏╥)o
4. Toss this picture download It took about five or six days. I wasted a lot of time. I didn’t achieve the effect and didn’t understand the pipeline. I rewrite the function. Now I can see the result. The knowledge was successfully corrected. The sample code was successfully downloaded. The rewrite function specifically defines the operation of the named file. Haven't mastered o(╥﹏╥)o
This problem needs to be solved before we
study the selenium of scrapy.