-
case study:
- News crawl data in the domestic sector Netease news: Demand
- Requirements Analysis: When you click the hyperlink to enter the domestic domestic corresponding page, you will find the current page to show the news data is dynamically loaded out, if the request for the url directly by the program is not dynamically load data to get news out of . Then we need to use selenium to instantiate a browser object, requesting the url in that object, get news data dynamically loaded.
selenium in principle scrapy used in the analysis:
- When the engine submitted url domestic sector corresponding to the request to download, a download page data for downloading, and then downloaded to the page data, in response to the package submitted to the engine, the engine will be transferred to the response Spiders. Page response data stored in the object Spiders received no news of where the data is dynamically loaded. To get news of dynamically loaded data, you need to submit to intercept the object in response to the response of the engine to download in the download middleware, tampering cut its page data internally stored, modified to carry the news out of the dynamic loading data , then the object has been tampered with final response to Spiders parse operation.
The use of selenium in scrapy process:
- Constructor crawler override file, selenium using a browser object to instantiate in the process (because the browser is instantiated objects only once)
- Rewrite reptile files closed (self, spider) method, close the browser in its internal objects. This method is invoked at the end of reptiles
- Rewrite process_response method to download middleware, so the method of response object to intercept and tamper response data stored in the page
- Open the downloaded middleware in the configuration file
Code:
spider
Import scrapy from the Selenium Import webdriver from selenium.webdriver.chrome.options Import Options from wangyiPro.items Import WangyiproItem "" " crawling Netease domestic and international news headlines and content " "" class WangyiSpider (scrapy.Spider): name = ' wangyi ' # allowed_domains = [' www.163.com '] start_urls = [ ' https://news.163.com/domestic/ ' , ' https://news.163.com/world/ '] def __init__ (Self ): options = webdriver.ChromeOptions() options.add_argument('--window-position=0,0'); # chrome 启动初始位置 options.add_argument('--window-size=1080,800'); # chrome 启动初始大小 self.browser = webdriver.Chrome(executable_path='C://xx//chromedriver.exe' ,chrome_options=options) def parse(self, response): div_list = response.xpath('//div[@class="ndi_main"]/div') for div_item indiv_list: title = div_item.xpath ( ' ./div/div [. 1] / H3 / A / text () ' ) .extract_first () new_detail_url = div_item.xpath ( ' ./div/div[1]/h3/a / @ href ' ) .extract_first () Item = WangyiproItem () Item [ ' title ' ] = title # for details page news initiate Request yield scrapy.Request (url = new_detail_url, callback = self.parse_detail, Meta = { ' Item ' :}) Item # request parameter passing Item # parse news content def parse_detail (Self, the Response): content = response.xpath('//*[@id="endText"]//text()').extract() content = ''.join(content) item = response.meta['item'] item['content'] = content.strip() yield item def closed(self,spider): self.browser.quit()
middleware
from scrapy import signals from time import sleep from scrapy.http import HtmlResponse class WangyiproDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None # 拦截响应对象进行篡改 def process_response(self, request, response, spider): #With the Response returned from at The Called at The Downloader. # Selected specified response object tampering # by request url specified # by request to specify the Response # Spider reptile objects Bro = spider.browser # Gets browser object reptile defined IF request.url in spider.start_urls: # the response # tampering instantiate a new response object (containing news data dynamically loaded) to replace the original old response object # seleium easily obtain dynamic data based on bro.get (request.url) SLEEP ( 3 ) Bro. execute_script ( ' the window.scrollTo (0, document.body.scrollHeight) ' ) SLEEP ( . 1 ) page_text = bro.page_source # 包含了动态加载对象 new_response = HtmlResponse(url=request.url,body=page_text,encoding="utf-8",request=request) return new_response else: # response # 其他请求 # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain pass def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)
pipeline file
Import pymysql class WangyiproPipeline (Object): # constructor DEF the __init__ (Self): self.conn = None # define a descriptor attribute self.cursor = None self.num = 0 # following are in the process of rewriting the parent class : # start reptiles performed once DEF open_spider (Self, Spider): self.conn = pymysql.Connect (= Host ' 192.168.xx.xx ' , Port = 3306, = User ' the root ' , password = ' XX ' , = DB ' xx_db' , Charset = ' utf8 ' ) Print ( ' reptile database start ' ) # deal specifically target item # because the method is called multiple times to perform, so the file open and close operations written in the other two will each perform a method. DEF process_item (Self, Item, Spider): author = Item [ ' title ' ] Content = Item [ ' Content ' ] self.cursor = self.conn.cursor () the try : self.cursor.execute ( 'insert into qiubai values(%s,%s)', (author, content)) self.conn.commit() except Exception as e: print(e,content[0,20]) self.conn.rollback() return item def close_spider(self, spider): print('爬虫数据库结束') self.cursor.close() self.conn.close()
File items
class WangyiproItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() content = scrapy.Field() pass
Configuration setting
= USER_AGENT ' Mozilla / 5.0 (Macintosh; Intel Mac OS the X-10_12_0) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 68.0.3440.106 Safari / 537.36 ' # disguise the identity of the carrier request # Obey a robots.txt rules # ROBOTSTXT_OBEY = True ROBOTSTXT_OBEY = false # negligible or non-compliance with robots protocol # display only the specified type of the log information , LOG_LEVEL, = ' ERROR ' # the Configure maximum Concurrent Requests Performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32 # the Configure a Delay for Requests for the Same Website ( default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'wangyiPro.middlewares.WangyiproSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543, } # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'wangyiPro.pipelines.WangyiproPipeline': 300, }