Day02 crawling use of watercress TOP250 movie + selenium

A crawling IMDb top250

First of all we need to know what we need to crawl:

  Movie name, movie url, film director, film starring
  movie Year, Genre, film scores, movie reviews, movie Introduction

1, all analyzes url home
on the first page: https: //movie.douban.com/top250 start = 0 & filter =?
Second page: https:? //Movie.douban.com/top250 start = 25 & filter =
third page : https:? //movie.douban.com/top250 start = 50 & filter =

Reptiles and the three-part song

1, the transmission request

1 import requests
2 import re
3 def get_page(url):
4     response = requests.get(url)
5     return response

2, analysis data

To all kinds of information of the movie for the regular matching rules:

Movie name, movie url, film director, movie star, movie Year, Genre, film scores, movie reviews, movie Introduction

# <div class="item">.*?<em class="">(.*?)</em>.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*?<span class="title">(.*?)</span>
# .*? 导演: (.*?)主演: (.*?)<br>(.*?)</p>.*?<span class="rating_num"
# .*?>(.*?)</span>.*? <span>(.*?)人评价</span>.*? <span class="inq">(.*?)</span>

Analytic function

1 def parse_index(html):
2     movie_list = re.findall('<div class="item">.*?<em class="">(.*?)</em>.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*? 导演: (.*?)主演: (.*?)<br>(.*?)</p>.*?<span class="rating_num" .*?>(.*?)</span>.*? <span>(.*?)人评价</span>.*? <span class="inq">(.*?)</span>',html,re.S)
3     return movie_list

3, save data

. 1  DEF save_data (Movie):
 2      Top, m_url, name, daoyan, the Actor, year_type, Point, the commit, desc = Movie
 . 3      year_type = year_type.strip ( ' \ n- ' )
 . 4      Data = F '' ' 
. 5          === Welcome to ======= ==========
 6          movies ranking: {Top}
 7          movie name: m_url} {
 8          movie url: {name}
 9          film director: daoyan} {
 10          movie starring : {} the Actor
 11          years type: year_type} {
 12          film scores: {Point}
 13          movie review: the commit} {
 14          film synopsis: {desc}
15          ============================
 16          \ n-
 . 17      '' ' 
18 is      Print (Data)
 . 19      with Open ( ' douban_top250.txt ' , ' A ' , encoding = ' UTF-. 8 ' ) AS F:
 20 is          f.write (Data)
 21 is      Print (F ' movie: {name} successfully written ' )

The main function is:

. 1  IF  the __name__ == ' __main__ ' :
 2      # spliced all home 
. 3      NUM = 0
 . 4      for Line in Range (10 ):
 . 5          URL = F ' https://movie.douban.com/top250?start={num}&filter = ' 
. 6          NUM = + 25
 . 7          Print (URL)
 . 8  
. 9          # to each home transmits a request 
10          index_res = the get_page (URL)
 . 11  
12 is          # parsing home page for movie information 
13 is          movie_list = parse_index(index_res.text)
14 
15         for movie in movie_list:
16             # print(movie)
17 
18             #3.保存数据
19             save_data(movie)

 

Second, the basic use of selenium

1  from the Selenium Import webdriver   # used to drive the browser 
2  # from selenium.webdriver Import ActionChains # code to crack the slide when you can drag the picture with a 
3  from selenium.webdriver.common.by Import By   # in what ways to find, By.ID, By.CSS_SELECTOR 
. 4  from selenium.webdriver.common.keys Import keys   # keyboard operations 
. 5  from selenium.webdriver.support Import expected_conditions EC AS   # and together with the following WebDriverWait 
. 6  from selenium.webdriver.support.wait ImportWebDriverWait    # Wait for page load some of the elements 
. 7  Import Time
 . 8  
. 9  # way: by opening a browser driver 
10  # Driver = webdriver.Chrome (R & lt 'drive absolute path /webdriver.exe') 
. 11  
12 is  # way: the webdriver .exe driven into the python interpreter installation directory / Scripts folder 
13 is  # python interpreter installation directory / Scripts configure the environment variables 
14  # python interpreter environment variable installation configuration directory 
15 driver = webdriver.Chrome (R & lt ' D: \ programming \ Python \ the Scripts \ chromedriver.exe ' )
 16  the try :
 . 17      driver.get ( ' https://www.jd.com/ ' )
 18 is     # Get the object wait 10 seconds Display 
19      # may wait for ten seconds to load a label 
20 is      the wait = WebDriverWait (Driver, 10 )
 21 is  
22 is      # lookup id to Key 
23 is      The input_tag = wait.until (EC.presence_of_element_located (
 24          (By.ID, ' Key ' )
 25      ))
 26 is      the time.sleep (. 5 )
 27  
28      # in the trade name input box, 
29      input_tag.send_keys ( ' doll ' )
 30  
31 is      # press enter keyboard 
32      input_tag.send_keys (Keys.ENTER)
 33 
34 is      the time.sleep (20 is )
 35  
36  the finally :
 37 [      # close the browser operating system resource release 
38 is      driver.close ()

selenium selector

. 1  '' '' '' 
2  from Selenium Import the webdriver   # Web drive 
. 3  from selenium.webdriver.common.keys Import Keys   # keyboard operation 
. 4  Import Time
 . 5  
. 6  Import Time
 . 7  
. 8 Driver = webdriver.Chrome ()
 . 9  
10  the try :
 . 11  
12 is      # implicit wait: call the prior GET 
13 is      # wait 10 seconds to load any element 
14      driver.implicitly_wait (10 )
 15  
16      driver.get ( 'https://www.baidu.com/ ' )
 . 17  
18 is      # explicit wait: call need after GET 
. 19      the time.sleep (. 5 )
 20 is  
21 is      ' '' 
22 is      ============= All methods == ===================
 23          Element label is to find a
 24-          Elements is to find all the labels
 25      '' ' 
26      # auto-login Baidu Start 
27      # 1, find_element_by_link_text # go through the text link 
28      LOGIN_LINK = driver.find_element_by_link_text ( ' login ' )
 29      login_link.click ()   # click Log 
30  
31     time.sleep(1)
32 
33     # 2、find_element_by_id # 通过id去找
34     user_login = driver.find_element_by_id('TANGRAM__PSP_10__footerULoginBtn')
35     user_login.click()
36 
37     time.sleep(1)
38 
39     # 3、find_element_by_class_name
40     user = driver.find_element_by_class_name('pass-text-input-userName')
41     user.send_keys('*****')
42 
43     # 4、find_element_by_name
44     pwd = driver.find_element_by_name('password')
45     pwd.send_keys('*****')
46 
47     submit = driver.find_element_by_id('TANGRAM__PSP_10__submit')
48     submit.click()
49     # end
50 
51     # 5、find_element_by_partial_link_text
52     # 局部链接文本查找
53     login_link = driver.find_element_by_partial_link_text('')
54     login_link.click()
55 
56 is      # . 6, find_element_by_css_selector 
57 is      # Find The attribute selector element 
58      # :. Class 
59      # #: ID 
60      login2_link = driver.find_element_by_css_selector ( ' .tang-Pass-footerBarULogin ' )
 61 is      login2_link.click ()
 62 is  
63 is      # . 7, find_element_by_tag_name 
64      div = driver.find_elements_by_tag_name ( ' div ' )
 65      Print (div)
 66  
67  
68      the time.sleep (20 is )
 69  
70  the finally:
 71      # close the browser release operating system resources 
72      driver.close ()

 

Guess you like

Origin www.cnblogs.com/tanknb/p/11123359.html