A log level .scrapy
- When using scrapy crawl spiderFileName run the program, the log information is scrapy terminal printout.
- log information Category:
ERROR: General error
WARNING: Warnings
INFO: general information
DEBUG: debug information
- Specify the output setting log information:
In settings.py configuration file, add
=, LOG_LEVEL, 'specify the log information category' can. LOG_FILE = ' log.txt ' indicates that the log information is written to the specified file is stored.
II: pass the request parameters:
- In some cases, we crawled data is no longer the same page, for example, we are crawling a movie website, the name of the movie, score one page, and other details of the film to be crawling in its secondary subpages, then we need to use the request of mass participation
- Case: crawling www.id97.com Movie Network, the movie name a page, the type of score release time a two page, the director, the film crawling.
Crawling file:
# -*- coding: utf-8 -*- import scrapy from moviePro.items import MovieproItem class MovieSpider(scrapy.Spider): name = 'movie' allowed_domains = ['www.id97.com'] start_urls = ['http://www.id97.com/'] def parse(self, response): div_list = response.xpath('//div[@class="col-xs-1-5 movie-item"]') for div in div_list: Item=MovieproItem () Item [ ' name ' ] = div.xpath ( ' .// h1 of / A / text () ' ) .extract_first () Item [ ' Score ' ] = div.xpath ( ' .//h1/em/ text () ' ) .extract_first () # XPath (String (.)) denotes the data value to extract all child nodes under the current node (.) indicates that the current node Item [ ' kind ' ] = div.xpath ( ' .// div [@ class = "otherinfo"] ' ) .xpath ( ' String (.) ' ) .extract_first () Item [ 'detail_url' ] = Div.xpath ( ' ./div/a/@href ' ) .extract_first () # requested page two details, parse the contents of the corresponding page two, meta data Request parameter passing through the yield scrapy.Request (item URL = [ ' detail_url ' ], the callback = self.parse_detail, Meta = { ' item ' : item}) DEF parse_detail (Self, Response): # Get item by Response item response.meta = [ ' item ' ] item [ ' the Actor ' ] response.xpath = ( ' // div [@ class = "Row"] // Table / TR [. 1] / A / text () ' ).extract_first() item['time'] = response.xpath('//div[@class="row"]//table/tr[7]/td[2]/text()').extract_first() item['long'] = response.xpath('//div[@class="row"]//table/tr[8]/td[2]/text()').extract_first() #提交item到管道 yield item
File items:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class MovieproItem(scrapy.Item): # define the fields for your item here like: name = scrapy.Field() score = scrapy.Field() time = scrapy.Field() long = scrapy.Field() actor = scrapy.Field() kind = scrapy.Field() detail_url = scrapy.Field()
Pipe file:
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import json class MovieproPipeline(object): def __init__(self): self.fp = open('data.txt','w') def process_item(self, item, spider): dic = dict(item) print(dic) json.dump(dic,self.fp,ensure_ascii=False) return item def close_spider(self,spider): self.fp.close()
III. How to improve crawling efficiency of scrapy
Increase concurrency:
The default scrapy open to 32 concurrent threads, may be appropriately increased. Modify settings in the configuration file
CONCURRENT_REQUESTS = 100 is 100, 100 to become complicated set
Reduce log level:
When you run scrpay, there will be a lot of log information output, in order to reduce cpu usage, you can set the output log information, write to the configuration file to INFO or ERROR: LOG_LEVEL = "INFO"
Prohibit cookie:
If the cookie is not really needed, at the time of scrapy crawling can disable cookie data may thus reduce the cpu usage, improve crawl efficiency, write in the settings file: COOKIES_ENABLED = False
Ban again:
Identification of re-HTTP request (retry) crawling speed slows down, since it is possible to retry prohibition, prepared in the configuration file:
RETRY_ENABLED = False
Reduce download times out:
If a link to a very slow crawling, reduce the download time-out card can make the main link was abandoned quickly, so as to enhance efficiency, it is written in the configuration file:DOWNLOAD_TIMEOUT = 10 超时时间为10s
Test case: crawling Xiaohua Wang Xiao flower pictures: www.421609.com
Reptile file:
# -*- coding: utf-8 -*- import scrapy from xiaohua.items import XiaohuaItem class XiahuaSpider(scrapy.Spider): name = 'xiaohua' allowed_domains = ['www.521609.com'] start_urls = ['http://www.521609.com/daxuemeinv/'] pageNum = 1 url = 'http://www.521609.com/daxuemeinv/list8%d.html' def parse(self, response): li_list = response.xpath('//div[@class="index_img list_center"]/ul/li') for li in li_list: school = li.xpath('./a/img/@alt').extract_first() img_url = li.xpath('./a/img/@src').extract_first() item = XiaohuaItem() item['school'] = school item['img_url'] = 'http://www.521609.com'yieldImgurl + item if self.pageNum < 10: self.pageNum += 1 url = format(self.url % self.pageNum) #print(url) yield scrapy.Request(url=url,callback=self.parse)
item file
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy class XiaohuaItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() school=scrapy.Field() img_url=scrapy.Field()
pipelines.py
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import json import os import urllib.request class XiaohuaPipeline(object): def __init__(self): self.fp = None def open_spider(self,spider): print('开始爬虫') self.fp = open('./xiaohua.txt','w') def download_img(self,item): url = item['img_url'] fileName = item['school']+'.jpg' if not os.path.exists('./xiaohualib'): os.mkdir('./xiaohualib') filepath = os.path.join('./xiaohualib',fileName) urllib.request.urlretrieve(url,filepath) print(fileName + " download success " ) DEF process_item (Self, Item, Spider): obj = dict (Item) json_str = json.dumps (obj, ensure_ascii = False) self.fp.write (json_str + ' \ the n- ' ) # Download pictures self.download_img (Item) return Item DEF close_spider (Self, Spider): Print ( ' end of the reptiles ' ) self.fp.close ()
settings.py configuration file:
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 100 COOKIES_ENABLED = False LOG_LEVEL = 'ERROR' RETRY_ENABLED = False DOWNLOAD_TIMEOUT = 3 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 DOWNLOAD_DELAY = 3