selenium crawling all live betta room information

Or analyze the general process:

  • First Chrome browser or packet capture analysis element, which is the URL: https://www.douyu.com/directory/all
  • Find information about all the rooms are kept in an unordered list li, the first so we can get a list of objects containing li element, and then in operation for each element one by one
  • Analysis betta page, there is a Next button, a li, class = "dy-Pagination-item-custom", but when the trouble to the last page of time, class = "dy-Pagination-disabled dy-Pagination -next ", so we want to make use of selenium simulate a click on this button, we should take advantage of get_elements_by_xpath () function, so you get to the last page less, and you can terminate the program. And with reason is that when the elements to get the last one when not if, element will complain
  • Then still a common routine: sending a request to obtain a response, extracting data elements and the next page, save the data, click on the next cycle of the elements ......

Two pit encountered:

  • We need time.sleep () function and then forced to wait for the page to load completely get the elements, otherwise an error, sleep a few seconds to see your speed up
  • xpath positioning when some class as such on the website: class = "abc" or class = "abc", in front of or behind the space, time xpath processing must also have spaces, otherwise not obtain

Code:

 

 1 import time
 2 from selenium import webdriver
 3 
 4 
 5 class DouyuSpider(object):
 6     def __init__(self):
 7         self.start_rul = 'https://www.douyu.com/directory/all'
 8         self.driver = webdriver.Chrome()
 9 
10     def get_content_list(self):
11         time.sleep(10)  # 强制等待10秒,否则可能报错
12         li_list = self.driver.find_elements_by_xpath('//ul[@class="layout-Cover-list"]/li')
13         content_list = []
14         for li in li_list:
15             item = {}
16             item['room_img'] = li.find_element_by_xpath('.//img[@class="DyImg-content is-normal "]').get_attribute('src')
17             item['room_title'] = li.find_element_by_xpath('.//h3[@class="DyListCover-intro"]').text
18             item['root_category ' ] = li.find_element_by_xpath ( ' .//span[@class="DyListCover-zone "] ' ) .text
 . 19              Item [ ' AUTHOR_NAME ' ] = li.find_element_by_class_name ( ' DyListCover-User ' ) .text
 20 is              Item [ ' watch_num ' ] = li.find_element_by_class_name ( ' DyListCover-Hot ' ) .text
 21 is              content_list.append (Item)
 22 is              Print (Item)   # print each room live information acquired 
23          #Get the next element, in order to prevent not being given, here Elements, not necessarily turn the last page, and to return a list of 
24          next_url = self.driver.find_elements_by_xpath ( ' // Li [@ class = "Dy-Pagination -next "] ' )
 25          next_url next_url = [0] IF len (next_url)> 0 the else None
 26 is          return CONTENT_LIST, next_url
 27  
28      DEF save_content_list (Self, CONTENT_LIST):
 29          Pass   # stored data do not demonstrate here 
30  
31 is      DEF RUN (Self):   # implement the main logic 
32          # 1.start_url 
33 is          # 2. transmission request acquisition response 
34         self.driver.maximize_window ()
 35          self.driver.get (self.start_rul)
 36          # 3. extracting data, the next element is extracted 
37 [          CONTENT_LIST, next_url = self.get_content_list ()
 38 is          # 4. Save the data 
39          Self. save_content_list (CONTENT_LIST)
 40          # 4. click next element, loop 
41 is          the while next_url IS  Not None:
 42 is              next_url.click ()
 43 is              CONTENT_LIST, next_url = self.get_content_list ()
 44 is              self.save_content_list (CONTENT_LIST)
 45  
46 is  
47  
48 if __name__ == '__main__':
49     douban = DouyuSpider()
50     douban.run()

 

Guess you like

Origin www.cnblogs.com/springionic/p/11140982.html