Chapter VII + crawler crawling scrapy intermediate data based on the network the station crawlSpider

scrapy Middleware

Middleware scrapy divided into two: 
    one is: downloading intermediate between the engine and the downloading 
        function: in between the engine and the downloading, the middleware can intercept this request and response data entire project initiated by 
    another are: downloader between the middleware engine crawlers with the 

intercepted request to do? 
    proxy ip: request.META [ ' proxy

In response to intercepting doing? 
    Tamper-response data (does not substantially modify the response data) 
    to replace the response object (to change the agent is substantially ip) 

using selenium scrapy in the
  definition of a property in the reptiles class bro (a selenium instantiated Object browser)
  to rewrite a closed (self, spider) in reptiles, the browser object close in the method
  crawling reptiles bro properties in the spider by process_response middleware parameter
  write browser middleware automation obtain the source code page data
  to the page source data as the response data in response to the new object
  will respond to the new object is returned


Based on the station data crawlSpider crawling

the association between crawlSpider and Spider 
    crawlSpider is a subclass of Spider 
create a file-based crawlSpider reptile 
    scrapy genspider -t crawl PCPRO www.xxx.com

Example code:

Reptile file code:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from sunShinePro.items import SunshineproItem,sunConetent

# http://wz.sun0769.com/index.php/question/questionType?type=4&page=30
class SunSpider(CrawlSpider):
    name = 'sun'
    # allowed_domains = ['www.xxx.com']

    start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page=']
    #链接提取器
        #Action: according to the specified rule (the allow: regular) extracts the page source connection specified in the 
    Link = LinkExtractor (the allow = R & lt ' type =. 4 & Page = \ D + ' ) 
    link_detail = LinkExtractor (the allow = ' Question / \ D + / \ D + \. broad03.shtml ' ) 
    the rules = (
         # rule parser: connect the extractor to extract the source data page corresponding to the connection according to the specified rule (callback) for data analysis 
        the rule (Link, the callback = ' parse_item ' , Follow = False), 
        the rule ( link_detail, the callback = ' parse_detail ' ), 
    ) 
    DEF parse_detail (Self, Response): 
        Content = response.xpath (' / HTML / body / div [. 9] / Table [2] // TR [. 1] / TD / div [2] // text () ' ) .extract () 
        Content = '' .join (Content) 

        Item = sunConetent () 
        Item [ ' Content ' ] = Content 

        the yield Item 

    DEF parse_item (Self, Response):
         # Note: If there tbody xpath tags positioning, it is necessary to skip tbody 
        tr_list response.xpath = ( ' // * [@ = ID "morelist"] / div / Table [2] // TR / TD / TR Table // ' )
         for TR in tr_list: 
            title = tr.xpath ( './td[2]/a[2]/text()').extract_first()
            status = tr.xpath('./td[3]/span/text()').extract_first()
            item = SunshineproItem()
            item['title'] = title
            item['status'] = status

            yield item

pipelines.py in

class SunshineproPipeline(object):
    def process_item(self, item, spider):
        if item.__class__.__name__ == 'SunshineproItem':
            print(item['title'], item['status'])

        else:
            print(item['content'])
        return item

Remember to open the pipeline in settings.py

Used with the spider and selenium

Example code:

Reptile .py file code

Import Scrapy 

from Selenium Import the webdriver
 class WangyiSpider (scrapy.Spider): 
    name = ' wangyi ' 
    # allowed_domains = [ 'www.xxx.com'] 
    start_urls = [ ' https://news.163.com/world/ ' ] 

    # examples of a browser object 
    bro = webdriver.Chrome (executable_path = r
        div_list = response.xpath ( ' / html / body / div / div [3] / div [4] / div [1] / div / div / ul / she / div / div
            title = div.xpath('.//div[@class="news_title"]//a/text()'
            detail_url = div.xpath('.//div[@class="news_title"]//a/@href').extract_first()
            yield scrapy.Request(detail_url,self.parse_detail)
            print(title,detail_url)

    def
        content = response.xpath('//*[@id="endText"]//text()'
        content = ''.join(content)
        print(content)
    def
        self.bro.quit()

middlewares.py in

class
    
        
        
        
            Bro = spider.bro 
            bro.get ( ' https://news.163.com/world/ ' ) 
            SLEEP ( 2 )
             # scroll bar to scroll to the bottom 
            bro.excute_script ( ' the window.scrollTo (0, the document.body. the scrollHeight) ' ) 
            SLEEP ( . 1 ) 
            bro.excute_script ( ' the window.scrollTo (0, document.body.scrollHeight) ' ) 
            SLEEP ( . 1 ) 
            bro.excute_script ( ' the window.scrollTo (0, document.body.scrollHeight) ' )
            SLEEP ( . 1 ) 
            page_text = bro.page_source 

            new_response = HtmlResponse (URL = bro.current_url, = page_text body, encoding = ' UTF-. 8 ' , Request = Request) 

            return new_response
         the else :
             return Response # must return to the original data, or only You can get to the part of

In settings.py we have to configure the following parameters:

Intermediate between the engine and downloading # 
DOWNLOADER_MIDDLEWARES = { ' wangyiPro.middlewares.WangyiproDownloaderMiddleware ' }

 

Guess you like

Origin www.cnblogs.com/zty1304368100/p/11055203.html