Scrapy frame recursive resolution mode and post requests

Recursive crawling resolve multi-page page data

Demand show

  The xx straight employ all pages of the site search keywords and payroll data crawling persistent storage

demand analysis

  Each page corresponds to a url, then scrapy project needs in order to initiate a request for a page number corresponding to each url, then parsed by the author and the piece of content corresponding analytical method

Implementation

  1. Each page corresponds to the starting url url stored list (start_urls) in the crawler file. (Not recommended)

  2. Request manual initiation request method.

 

Code Editor

Import scrapy
 from bossPro.items Import BossproItem 

class BossSpider (scrapy.Spider): 
    name = ' BOSS ' 
    # allowed_domains = [ 'www.xxx.com'] 
    start_urls = [ ' https://www.zhipin.com/job_detail/? Python% =% E5 Query the BC% 80% E5 8F% 101 010 100% = 91 is & City & Industry position = & = ' ] 

    # generic templates, immutable (the page URL) 
    URL = ' https://www.zhipin.com/c101010100/? query = python development Page =% D & ' 
    Page = 2 DEF the parse (Self, Response):
         Print (F ' are crawling on {self.Page data page}

    ')
        li_list=response.xpath('//*[@id="main"]/div/div[3]/ul/li | //*[@id="main"]/div/div[2]/ul/li')
        for li in li_list:
            job_title=li.xpath('.//div[@class="info-primary"]/h3/a/div[1]/text()').extract_first()
            job_salary=li.xpath('.//div[@class="info-primary"]/h3/a/span/text()').extract_first()

            item=BossproItem()
            item['job_title']=job_title
            item[' Job_salary ' ] = job_salary
             the yield   item    # submitted item to persistent pipeline 

        IF self.page <=. 5 :
             # To access another page of data, send a request manually 
            NEW_URL = the format (self.url% self.page) 
            Self .page +. 1 =
             # manual request, call_back (Back function) is a data analysis 
            # must be added to yield upon manual request 
       # recursive crawling data, the callback parameter is the callback function (url request will give the corresponding data continued parse parse 
            yield scrapy.Request (NEW_URL, callback = self.parse)      

The station data crawling

- crawling the station data
     - upon a manual request achieved
         - for full station data crawling
         - achieve the depth of crawling
      - manually request:
         - the yield scrapy.Request (URL, the callback)
         - the yield scrapy.FormRequest (URL, FormData, callback)

Website content crawling tag jump information

 

  Plus climb to take the details of each recruitment in the above demand, which we need to advance to each url details crawling out, and then pick up details

Code Display

# - * - Coding: UTF-. 8 - * - 
Import Scrapy
 from bossDeepPro.items Import BossdeepproItem 

class BossSpider (scrapy.Spider): 
    name = ' BOSS ' 
    # allowed_domains = [ 'www.xxx.com'] 
    start_urls = [ ' HTTPS : //www.zhipin.com/job_detail/ Python Query =% E5% BC% 80% 91% E5% 8F & City & Industry 101 010 100 = = = & position? ' ] 

    # generic url template (immutable) 
    url = ' HTTPS: // development Page www.zhipin.com/c101010100/?query=python =% D & ' 
    Page = 2 DEF the parse (Self, Response):
        print

    ('正在爬取第{}页的数据'.format(self.page))
        li_list = response.xpath('//*[@id="main"]/div/div[3]/ul/li | //*[@id="main"]/div/div[2]/ul/li')
        for li in li_list:
            job_title = li.xpath('.//div[@class="info-primary"]/h3/a/div[1]/text()').extract_first()
            salary = li.xpath('.//div[@class="info-primary"]/h3/a/span/text()').extract_first ()
            BossdeepproItem ()
            item =instantiated item object: object must parse and let parse_detail shared#
            item [' JOB_TITLE ' ] = JOB_TITLE 
            Item [ ' the salary ' ] = the salary 

            detail_url = ' https://www.zhipin.com ' + li.xpath ( ' .//div[@class="info-primary"]/h3/ a / @ the href ' ) .extract_first ()
             # of Information page url initiate a manual request 
            the yield scrapy.Request (url = detail_url, the callback = self.parse_detail, Meta = { ' Item ' : Item}) 


        IF self.page <= 5 :
             # of other page request manual transmission 
            new_url = format (self.url%self.page)
             Print (NEW_URL) 
            self.page +. 1 =
             # manually request 
            # the callback data parsing 
            the yield scrapy.Request (URL = NEW_URL, the callback = self.parse) 

    # parsing job description 
    DEF parse_detail (Self, Response): 
        Item = response.meta [ ' Item ' ] 
        the job_desc = response.xpath ( ' // * [@ ID = "main"] / div [. 3] / div / div [2] / div [2] / div [. 1] / div // text () ' ) .extract () 
        the job_desc = ' ' .join (the job_desc) 

        Item [ ' the job_desc'] = job_desc

        yield item

Depth crawl

- Depth crawling
     - manually request
     - the request parameter passing: persistent storage, to store different callback parsed data item to the same object. Request is transmitted parameter passing item object.
        - usage scenarios: If scrapy crawling on the same data is not a page
         - transfer mode: transferring the encapsulated data to the meta dictionary, the meta passing a the callback
             the yield scrapy.Request (URL, the callback, meta)
         - received: 
            in using the callback specified callback function response is received:
                 - Item response.meta = [ ' Key ' ]

Five core components of workflow

Engine (Scrapy) 
for processing the entire data stream processing system, triggering transaction (frame core) 
scheduler (Scheduler) 
for receiving a request sent over the engine, is pressed into the queue, and returns the engine again when the request can imagine as a URL (web site is crawled pages or links) priority queue, to decide what the next it is a URL to be crawled, while removing duplicate URLs 
Downloader (Downloader) 
for downloading web content, and web page content back to the spider (Scrapy Downloader is based on twisted this effective asynchronous model) 
reptiles (spiders) 
crawler is the main work is used to extract the information they need from a particular web page, the so-called entity ( Item). Users can also extract the link, let Scrapy continue to crawl to the next page 
project pipeline (Pipeline) 
is responsible for handling reptiles drawn from the Web entity, the main function is persistent entity, to verify the effectiveness of the entity, remove unnecessary information . When a page is parsed crawler, the pipe will be sent to the project, and after a few specific order of processing data.

post request

- problem: Before the code, we never had to manually send a request to start url start_urls been stored in the list, but the starting url indeed been sending the request, and that this is how to achieve it?

- Answer: In fact, because the reptile reptilian file to inherit the Spider parent class start_requests (self) This method, which can start_urls list of url initiate a request:

  def start_requests(self):
        for u in self.start_urls:
           yield scrapy.Request(url=u,callback=self.parse)

Note : The default method of implementation is the initial launch url get request, if you want to initiate a post request, you need to subclasses override this method.

  - Methods: Rewrite start_requests method, allowed to initiate post request:

DEF start_requests (Self):
         # URL request 
        POST_URL = ' http://fanyi.baidu.com/sug ' 
        # post request parameter 
        FormData = {
             ' kW ' : ' Wolf ' , 
        } 
        # sending post request 
        the yield scrapy.FormRequest ( url = post_url, formdata = formdata, callback = self.parse)

How to improve crawling efficiency scrapy

Increase concurrency: 
    default scrapy turned to 32 concurrent threads, may be appropriately increased. Modify settings in the configuration file CONCURRENT_REQUESTS = 100 is 100, 100 to become complicated set. 

Reduce log level: 
    When you run scrapy, there will be a lot of log output information, in order to reduce CPU usage. Log output information may be provided or INFO to ERROR. Written in the configuration file:, LOG_LEVEL, = 'INFO' 

ban cookie: 
    If the cookie is not really needed, at the time of binary data can be scrapy crawling cookie to reduce CPU usage, improve crawl efficiency. Written in the configuration file: COOKIES_ENABLED = False 

prohibited Retry: 
    for re failed HTTP request (retry) will slow crawling speed, retry can be prohibited. Written in the configuration file: RETRY_ENABLED = False 

reduce the download time-out: 
    If a very slow crawling links, reduce the download time-out can make the jammed fast link was abandoned, thereby enhancing efficiency. Be written in the configuration file: DOWNLOAD_TIMEOUT = 10 timeout to 10s

Scrapy log level

  - When using scrapy crawl spiderFileName run the program in the terminal is scrapy printout of log information. 

  - the type of log information: 

        ERROR: General Error 

        the WARNING: Warning 

        INFO: general information 

        the DEBUG: debug information

       

   - setting the output log information specified: 

    in the configuration file settings.py added 

                    , LOG_LEVEL, = 'specify the log information category' can. 

                    LOG_FILE = ' log.txt ' indicates that the log information is written to the specified file is stored.

 

Guess you like

Origin www.cnblogs.com/sikuaiqian/p/11323758.html