scrapy Middleware
Middleware scrapy divided into two: one is: downloading intermediate between the engine and the downloading function: in between the engine and the downloading, the middleware can intercept this request and response data entire project initiated by another are: downloader between the middleware engine crawlers with the intercepted request to do? proxy ip: request.META [ ' proxy In response to intercepting doing? Tamper-response data (does not substantially modify the response data) to replace the response object (to change the agent is substantially ip)
using selenium scrapy in the
definition of a property in the reptiles class bro (a selenium instantiated Object browser)
to rewrite a closed (self, spider) in reptiles, the browser object close in the method
crawling reptiles bro properties in the spider by process_response middleware parameter
write browser middleware automation obtain the source code page data
to the page source data as the response data in response to the new object
will respond to the new object is returned
Based on the station data crawlSpider crawling
the association between crawlSpider and Spider
crawlSpider is a subclass of Spider
create a file-based crawlSpider reptile
scrapy genspider -t crawl PCPRO www.xxx.com
Example code:
Reptile file code:
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from sunShinePro.items import SunshineproItem,sunConetent # http://wz.sun0769.com/index.php/question/questionType?type=4&page=30 class SunSpider(CrawlSpider): name = 'sun' # allowed_domains = ['www.xxx.com'] start_urls = ['http://wz.sun0769.com/index.php/question/questionType?type=4&page='] #链接提取器 #Action: according to the specified rule (the allow: regular) extracts the page source connection specified in the Link = LinkExtractor (the allow = R & lt ' type =. 4 & Page = \ D + ' ) link_detail = LinkExtractor (the allow = ' Question / \ D + / \ D + \. broad03.shtml ' ) the rules = ( # rule parser: connect the extractor to extract the source data page corresponding to the connection according to the specified rule (callback) for data analysis the rule (Link, the callback = ' parse_item ' , Follow = False), the rule ( link_detail, the callback = ' parse_detail ' ), ) DEF parse_detail (Self, Response): Content = response.xpath (' / HTML / body / div [. 9] / Table [2] // TR [. 1] / TD / div [2] // text () ' ) .extract () Content = '' .join (Content) Item = sunConetent () Item [ ' Content ' ] = Content the yield Item DEF parse_item (Self, Response): # Note: If there tbody xpath tags positioning, it is necessary to skip tbody tr_list response.xpath = ( ' // * [@ = ID "morelist"] / div / Table [2] // TR / TD / TR Table // ' ) for TR in tr_list: title = tr.xpath ( './td[2]/a[2]/text()').extract_first() status = tr.xpath('./td[3]/span/text()').extract_first() item = SunshineproItem() item['title'] = title item['status'] = status yield item
pipelines.py in
class SunshineproPipeline(object): def process_item(self, item, spider): if item.__class__.__name__ == 'SunshineproItem': print(item['title'], item['status']) else: print(item['content']) return item
Remember to open the pipeline in settings.py
Used with the spider and selenium
Example code:
Reptile .py file code
Import Scrapy from Selenium Import the webdriver class WangyiSpider (scrapy.Spider): name = ' wangyi ' # allowed_domains = [ 'www.xxx.com'] start_urls = [ ' https://news.163.com/world/ ' ] # examples of a browser object bro = webdriver.Chrome (executable_path = r div_list = response.xpath ( ' / html / body / div / div [3] / div [4] / div [1] / div / div / ul / she / div / div title = div.xpath('.//div[@class="news_title"]//a/text()' detail_url = div.xpath('.//div[@class="news_title"]//a/@href').extract_first() yield scrapy.Request(detail_url,self.parse_detail) print(title,detail_url) def content = response.xpath('//*[@id="endText"]//text()' content = ''.join(content) print(content) def self.bro.quit()
middlewares.py in
class Bro = spider.bro bro.get ( ' https://news.163.com/world/ ' ) SLEEP ( 2 ) # scroll bar to scroll to the bottom bro.excute_script ( ' the window.scrollTo (0, the document.body. the scrollHeight) ' ) SLEEP ( . 1 ) bro.excute_script ( ' the window.scrollTo (0, document.body.scrollHeight) ' ) SLEEP ( . 1 ) bro.excute_script ( ' the window.scrollTo (0, document.body.scrollHeight) ' ) SLEEP ( . 1 ) page_text = bro.page_source new_response = HtmlResponse (URL = bro.current_url, = page_text body, encoding = ' UTF-. 8 ' , Request = Request) return new_response the else : return Response # must return to the original data, or only You can get to the part of
In settings.py we have to configure the following parameters:
Intermediate between the engine and downloading #
DOWNLOADER_MIDDLEWARES = { ' wangyiPro.middlewares.WangyiproDownloaderMiddleware ' }