Or analyze the general process:
- First Chrome browser or packet capture analysis element, which is the URL: https://www.douyu.com/directory/all
- Find information about all the rooms are kept in an unordered list li, the first so we can get a list of objects containing li element, and then in operation for each element one by one
- Analysis betta page, there is a Next button, a li, class = "dy-Pagination-item-custom", but when the trouble to the last page of time, class = "dy-Pagination-disabled dy-Pagination -next ", so we want to make use of selenium simulate a click on this button, we should take advantage of get_elements_by_xpath () function, so you get to the last page less, and you can terminate the program. And with reason is that when the elements to get the last one when not if, element will complain
- Then still a common routine: sending a request to obtain a response, extracting data elements and the next page, save the data, click on the next cycle of the elements ......
Two pit encountered:
- We need time.sleep () function and then forced to wait for the page to load completely get the elements, otherwise an error, sleep a few seconds to see your speed up
- xpath positioning when some class as such on the website: class = "abc" or class = "abc", in front of or behind the space, time xpath processing must also have spaces, otherwise not obtain
Code:
1 import time 2 from selenium import webdriver 3 4 5 class DouyuSpider(object): 6 def __init__(self): 7 self.start_rul = 'https://www.douyu.com/directory/all' 8 self.driver = webdriver.Chrome() 9 10 def get_content_list(self): 11 time.sleep(10) # 强制等待10秒,否则可能报错 12 li_list = self.driver.find_elements_by_xpath('//ul[@class="layout-Cover-list"]/li') 13 content_list = [] 14 for li in li_list: 15 item = {} 16 item['room_img'] = li.find_element_by_xpath('.//img[@class="DyImg-content is-normal "]').get_attribute('src') 17 item['room_title'] = li.find_element_by_xpath('.//h3[@class="DyListCover-intro"]').text 18 item['root_category ' ] = li.find_element_by_xpath ( ' .//span[@class="DyListCover-zone "] ' ) .text . 19 Item [ ' AUTHOR_NAME ' ] = li.find_element_by_class_name ( ' DyListCover-User ' ) .text 20 is Item [ ' watch_num ' ] = li.find_element_by_class_name ( ' DyListCover-Hot ' ) .text 21 is content_list.append (Item) 22 is Print (Item) # print each room live information acquired 23 #Get the next element, in order to prevent not being given, here Elements, not necessarily turn the last page, and to return a list of 24 next_url = self.driver.find_elements_by_xpath ( ' // Li [@ class = "Dy-Pagination -next "] ' ) 25 next_url next_url = [0] IF len (next_url)> 0 the else None 26 is return CONTENT_LIST, next_url 27 28 DEF save_content_list (Self, CONTENT_LIST): 29 Pass # stored data do not demonstrate here 30 31 is DEF RUN (Self): # implement the main logic 32 # 1.start_url 33 is # 2. transmission request acquisition response 34 self.driver.maximize_window () 35 self.driver.get (self.start_rul) 36 # 3. extracting data, the next element is extracted 37 [ CONTENT_LIST, next_url = self.get_content_list () 38 is # 4. Save the data 39 Self. save_content_list (CONTENT_LIST) 40 # 4. click next element, loop 41 is the while next_url IS Not None: 42 is next_url.click () 43 is CONTENT_LIST, next_url = self.get_content_list () 44 is self.save_content_list (CONTENT_LIST) 45 46 is 47 48 if __name__ == '__main__': 49 douban = DouyuSpider() 50 douban.run()