A crawling IMDb top250
First of all we need to know what we need to crawl:
Movie name, movie url, film director, film starring
movie Year, Genre, film scores, movie reviews, movie Introduction
1, all analyzes url home
on the first page: https: //movie.douban.com/top250 start = 0 & filter =?
Second page: https:? //Movie.douban.com/top250 start = 25 & filter =
third page : https:? //movie.douban.com/top250 start = 50 & filter =
Reptiles and the three-part song
1, the transmission request
1 import requests 2 import re 3 def get_page(url): 4 response = requests.get(url) 5 return response
2, analysis data
To all kinds of information of the movie for the regular matching rules:
Movie name, movie url, film director, movie star, movie Year, Genre, film scores, movie reviews, movie Introduction
# <div class="item">.*?<em class="">(.*?)</em>.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*?<span class="title">(.*?)</span>
# .*? 导演: (.*?)主演: (.*?)<br>(.*?)</p>.*?<span class="rating_num"
# .*?>(.*?)</span>.*? <span>(.*?)人评价</span>.*? <span class="inq">(.*?)</span>
Analytic function
1 def parse_index(html): 2 movie_list = re.findall('<div class="item">.*?<em class="">(.*?)</em>.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*? 导演: (.*?)主演: (.*?)<br>(.*?)</p>.*?<span class="rating_num" .*?>(.*?)</span>.*? <span>(.*?)人评价</span>.*? <span class="inq">(.*?)</span>',html,re.S) 3 return movie_list
3, save data
. 1 DEF save_data (Movie): 2 Top, m_url, name, daoyan, the Actor, year_type, Point, the commit, desc = Movie . 3 year_type = year_type.strip ( ' \ n- ' ) . 4 Data = F '' ' . 5 === Welcome to ======= ========== 6 movies ranking: {Top} 7 movie name: m_url} { 8 movie url: {name} 9 film director: daoyan} { 10 movie starring : {} the Actor 11 years type: year_type} { 12 film scores: {Point} 13 movie review: the commit} { 14 film synopsis: {desc} 15 ============================ 16 \ n- . 17 '' ' 18 is Print (Data) . 19 with Open ( ' douban_top250.txt ' , ' A ' , encoding = ' UTF-. 8 ' ) AS F: 20 is f.write (Data) 21 is Print (F ' movie: {name} successfully written ' )
The main function is:
. 1 IF the __name__ == ' __main__ ' : 2 # spliced all home . 3 NUM = 0 . 4 for Line in Range (10 ): . 5 URL = F ' https://movie.douban.com/top250?start={num}&filter = ' . 6 NUM = + 25 . 7 Print (URL) . 8 . 9 # to each home transmits a request 10 index_res = the get_page (URL) . 11 12 is # parsing home page for movie information 13 is movie_list = parse_index(index_res.text) 14 15 for movie in movie_list: 16 # print(movie) 17 18 #3.保存数据 19 save_data(movie)
Second, the basic use of selenium
1 from the Selenium Import webdriver # used to drive the browser 2 # from selenium.webdriver Import ActionChains # code to crack the slide when you can drag the picture with a 3 from selenium.webdriver.common.by Import By # in what ways to find, By.ID, By.CSS_SELECTOR . 4 from selenium.webdriver.common.keys Import keys # keyboard operations . 5 from selenium.webdriver.support Import expected_conditions EC AS # and together with the following WebDriverWait . 6 from selenium.webdriver.support.wait ImportWebDriverWait # Wait for page load some of the elements . 7 Import Time . 8 . 9 # way: by opening a browser driver 10 # Driver = webdriver.Chrome (R & lt 'drive absolute path /webdriver.exe') . 11 12 is # way: the webdriver .exe driven into the python interpreter installation directory / Scripts folder 13 is # python interpreter installation directory / Scripts configure the environment variables 14 # python interpreter environment variable installation configuration directory 15 driver = webdriver.Chrome (R & lt ' D: \ programming \ Python \ the Scripts \ chromedriver.exe ' ) 16 the try : . 17 driver.get ( ' https://www.jd.com/ ' ) 18 is # Get the object wait 10 seconds Display 19 # may wait for ten seconds to load a label 20 is the wait = WebDriverWait (Driver, 10 ) 21 is 22 is # lookup id to Key 23 is The input_tag = wait.until (EC.presence_of_element_located ( 24 (By.ID, ' Key ' ) 25 )) 26 is the time.sleep (. 5 ) 27 28 # in the trade name input box, 29 input_tag.send_keys ( ' doll ' ) 30 31 is # press enter keyboard 32 input_tag.send_keys (Keys.ENTER) 33 34 is the time.sleep (20 is ) 35 36 the finally : 37 [ # close the browser operating system resource release 38 is driver.close ()
selenium selector
. 1 '' '' '' 2 from Selenium Import the webdriver # Web drive . 3 from selenium.webdriver.common.keys Import Keys # keyboard operation . 4 Import Time . 5 . 6 Import Time . 7 . 8 Driver = webdriver.Chrome () . 9 10 the try : . 11 12 is # implicit wait: call the prior GET 13 is # wait 10 seconds to load any element 14 driver.implicitly_wait (10 ) 15 16 driver.get ( 'https://www.baidu.com/ ' ) . 17 18 is # explicit wait: call need after GET . 19 the time.sleep (. 5 ) 20 is 21 is ' '' 22 is ============= All methods == =================== 23 Element label is to find a 24- Elements is to find all the labels 25 '' ' 26 # auto-login Baidu Start 27 # 1, find_element_by_link_text # go through the text link 28 LOGIN_LINK = driver.find_element_by_link_text ( ' login ' ) 29 login_link.click () # click Log 30 31 time.sleep(1) 32 33 # 2、find_element_by_id # 通过id去找 34 user_login = driver.find_element_by_id('TANGRAM__PSP_10__footerULoginBtn') 35 user_login.click() 36 37 time.sleep(1) 38 39 # 3、find_element_by_class_name 40 user = driver.find_element_by_class_name('pass-text-input-userName') 41 user.send_keys('*****') 42 43 # 4、find_element_by_name 44 pwd = driver.find_element_by_name('password') 45 pwd.send_keys('*****') 46 47 submit = driver.find_element_by_id('TANGRAM__PSP_10__submit') 48 submit.click() 49 # end 50 51 # 5、find_element_by_partial_link_text 52 # 局部链接文本查找 53 login_link = driver.find_element_by_partial_link_text('登') 54 login_link.click() 55 56 is # . 6, find_element_by_css_selector 57 is # Find The attribute selector element 58 # :. Class 59 # #: ID 60 login2_link = driver.find_element_by_css_selector ( ' .tang-Pass-footerBarULogin ' ) 61 is login2_link.click () 62 is 63 is # . 7, find_element_by_tag_name 64 div = driver.find_elements_by_tag_name ( ' div ' ) 65 Print (div) 66 67 68 the time.sleep (20 is ) 69 70 the finally: 71 # close the browser release operating system resources 72 driver.close ()