Recursive crawling resolve multi-page page data
Demand show
The xx straight employ all pages of the site search keywords and payroll data crawling persistent storage
demand analysis
Each page corresponds to a url, then scrapy project needs in order to initiate a request for a page number corresponding to each url, then parsed by the author and the piece of content corresponding analytical method
Implementation
1. Each page corresponds to the starting url url stored list (start_urls) in the crawler file. (Not recommended)
2. Request manual initiation request method.
Code Editor
Import scrapy from bossPro.items Import BossproItem class BossSpider (scrapy.Spider): name = ' BOSS ' # allowed_domains = [ 'www.xxx.com'] start_urls = [ ' https://www.zhipin.com/job_detail/? Python% =% E5 Query the BC% 80% E5 8F% 101 010 100% = 91 is & City & Industry position = & = ' ] # generic templates, immutable (the page URL) URL = ' https://www.zhipin.com/c101010100/? query = python development Page =% D & ' Page = 2 DEF the parse (Self, Response): Print (F ' are crawling on {self.Page data page} ') li_list=response.xpath('//*[@id="main"]/div/div[3]/ul/li | //*[@id="main"]/div/div[2]/ul/li') for li in li_list: job_title=li.xpath('.//div[@class="info-primary"]/h3/a/div[1]/text()').extract_first() job_salary=li.xpath('.//div[@class="info-primary"]/h3/a/span/text()').extract_first() item=BossproItem() item['job_title']=job_title item[' Job_salary ' ] = job_salary the yield item # submitted item to persistent pipeline IF self.page <=. 5 : # To access another page of data, send a request manually NEW_URL = the format (self.url% self.page) Self .page +. 1 = # manual request, call_back (Back function) is a data analysis # must be added to yield upon manual request # recursive crawling data, the callback parameter is the callback function (url request will give the corresponding data continued parse parse yield scrapy.Request (NEW_URL, callback = self.parse)
The station data crawling
- crawling the station data - upon a manual request achieved - for full station data crawling - achieve the depth of crawling - manually request: - the yield scrapy.Request (URL, the callback) - the yield scrapy.FormRequest (URL, FormData, callback)
Website content crawling tag jump information
Plus climb to take the details of each recruitment in the above demand, which we need to advance to each url details crawling out, and then pick up details
Code Display
# - * - Coding: UTF-. 8 - * - Import Scrapy from bossDeepPro.items Import BossdeepproItem class BossSpider (scrapy.Spider): name = ' BOSS ' # allowed_domains = [ 'www.xxx.com'] start_urls = [ ' HTTPS : //www.zhipin.com/job_detail/ Python Query =% E5% BC% 80% 91% E5% 8F & City & Industry 101 010 100 = = = & position? ' ] # generic url template (immutable) url = ' HTTPS: // development Page www.zhipin.com/c101010100/?query=python =% D & ' Page = 2 DEF the parse (Self, Response): print ('正在爬取第{}页的数据'.format(self.page)) li_list = response.xpath('//*[@id="main"]/div/div[3]/ul/li | //*[@id="main"]/div/div[2]/ul/li') for li in li_list: job_title = li.xpath('.//div[@class="info-primary"]/h3/a/div[1]/text()').extract_first() salary = li.xpath('.//div[@class="info-primary"]/h3/a/span/text()').extract_first () BossdeepproItem () item =instantiated item object: object must parse and let parse_detail shared# item [' JOB_TITLE ' ] = JOB_TITLE Item [ ' the salary ' ] = the salary detail_url = ' https://www.zhipin.com ' + li.xpath ( ' .//div[@class="info-primary"]/h3/ a / @ the href ' ) .extract_first () # of Information page url initiate a manual request the yield scrapy.Request (url = detail_url, the callback = self.parse_detail, Meta = { ' Item ' : Item}) IF self.page <= 5 : # of other page request manual transmission new_url = format (self.url%self.page) Print (NEW_URL) self.page +. 1 = # manually request # the callback data parsing the yield scrapy.Request (URL = NEW_URL, the callback = self.parse) # parsing job description DEF parse_detail (Self, Response): Item = response.meta [ ' Item ' ] the job_desc = response.xpath ( ' // * [@ ID = "main"] / div [. 3] / div / div [2] / div [2] / div [. 1] / div // text () ' ) .extract () the job_desc = ' ' .join (the job_desc) Item [ ' the job_desc'] = job_desc yield item
Depth crawl
- Depth crawling - manually request - the request parameter passing: persistent storage, to store different callback parsed data item to the same object. Request is transmitted parameter passing item object. - usage scenarios: If scrapy crawling on the same data is not a page - transfer mode: transferring the encapsulated data to the meta dictionary, the meta passing a the callback the yield scrapy.Request (URL, the callback, meta) - received: in using the callback specified callback function response is received: - Item response.meta = [ ' Key ' ]
Five core components of workflow
Engine (Scrapy)
for processing the entire data stream processing system, triggering transaction (frame core)
scheduler (Scheduler)
for receiving a request sent over the engine, is pressed into the queue, and returns the engine again when the request can imagine as a URL (web site is crawled pages or links) priority queue, to decide what the next it is a URL to be crawled, while removing duplicate URLs
Downloader (Downloader)
for downloading web content, and web page content back to the spider (Scrapy Downloader is based on twisted this effective asynchronous model)
reptiles (spiders)
crawler is the main work is used to extract the information they need from a particular web page, the so-called entity ( Item). Users can also extract the link, let Scrapy continue to crawl to the next page
project pipeline (Pipeline)
is responsible for handling reptiles drawn from the Web entity, the main function is persistent entity, to verify the effectiveness of the entity, remove unnecessary information . When a page is parsed crawler, the pipe will be sent to the project, and after a few specific order of processing data.
post request
- problem: Before the code, we never had to manually send a request to start url start_urls been stored in the list, but the starting url indeed been sending the request, and that this is how to achieve it?
- Answer: In fact, because the reptile reptilian file to inherit the Spider parent class start_requests (self) This method, which can start_urls list of url initiate a request:
def start_requests(self): for u in self.start_urls: yield scrapy.Request(url=u,callback=self.parse)
Note : The default method of implementation is the initial launch url get request, if you want to initiate a post request, you need to subclasses override this method.
- Methods: Rewrite start_requests method, allowed to initiate post request:
DEF start_requests (Self): # URL request POST_URL = ' http://fanyi.baidu.com/sug ' # post request parameter FormData = { ' kW ' : ' Wolf ' , } # sending post request the yield scrapy.FormRequest ( url = post_url, formdata = formdata, callback = self.parse)
How to improve crawling efficiency scrapy
Increase concurrency: default scrapy turned to 32 concurrent threads, may be appropriately increased. Modify settings in the configuration file CONCURRENT_REQUESTS = 100 is 100, 100 to become complicated set. Reduce log level: When you run scrapy, there will be a lot of log output information, in order to reduce CPU usage. Log output information may be provided or INFO to ERROR. Written in the configuration file:, LOG_LEVEL, = 'INFO' ban cookie: If the cookie is not really needed, at the time of binary data can be scrapy crawling cookie to reduce CPU usage, improve crawl efficiency. Written in the configuration file: COOKIES_ENABLED = False prohibited Retry: for re failed HTTP request (retry) will slow crawling speed, retry can be prohibited. Written in the configuration file: RETRY_ENABLED = False reduce the download time-out: If a very slow crawling links, reduce the download time-out can make the jammed fast link was abandoned, thereby enhancing efficiency. Be written in the configuration file: DOWNLOAD_TIMEOUT = 10 timeout to 10s
Scrapy log level
- When using scrapy crawl spiderFileName run the program in the terminal is scrapy printout of log information. - the type of log information: ERROR: General Error the WARNING: Warning INFO: general information the DEBUG: debug information - setting the output log information specified: in the configuration file settings.py added , LOG_LEVEL, = 'specify the log information category' can. LOG_FILE = ' log.txt ' indicates that the log information is written to the specified file is stored.