Day02 classroom achievement

Yesterday Review:
    a crawler Fundamentals
        - reptiles whole process
        1. Send request
        2 receives the response data
        3. parsing and extracting valuable data
        4. Save the data

    two requests Requests Database
        - GET
            URL
            headers
            Cookies

        - POST
            URL
            headers
            Cookies
            Data
    three crawl take Xiaohua network video
        1. Home parsed to extract details page
        2. extract url via video details page
        3. obtain a binary stream video locally written

    four automatic login GitHub
        1. analysis of the request header and body information request
            - username
            - password
            - token
            - assorted

        2.token
            - extraction of the login page by parsing

        3. session_url transmission request
            - headers:
                - User-Agent

            - Cookies:
                - Cookies login page

            - Data:
                - form_data
**************** ****************************

today contents:

a request requests crawling IMDb database information
    - requested url
        HTTPS: //movie.douban .com / TOP250

    - request embodiment
        the GET

    - request header
        User-Agent
        Cookies

A crawling IMDb

IMDb message:
Rank movies, movie name, movie url, film director, film starring 
movie Year, Genre, film scores, movie reviews, movie Introduction
1. Analyze all url home page 
on the first page:
https://movie.douban.com/top250?start=0&filter=
second page:
https://movie.douban.com/top250?start=25&filter=
third page :
https://movie.douban.com/top250?start=50&filter=
. . . .
import requests
import re
Reptiles and the three-part song
1. The transmission request
def get_page(url):
    response=requests.get(url)
    #print(response.text)
    return response
2. Parse the data
Movie rankings, movies url, movie name, movie director, movie star, movie Year, Genre, film scores, movie reviews, movie Introduction 
<div class = "item"> . *? <Em class = ""> (. * ?)
</ EM> <a href="(.*?)"> * *.?.?
<span class = "title"> (*).?
</ span> * director:? (*.? ) starring:.? (*) <br> (*).?
</ the p-> * <span class = "rating_num" Property = "v: Average">.?
.? (*) </ span> *.? <span> (. *?) people commented
</ span>. *? < span class = "inq"> (. *?) </ span>
def parse_index(html):
    movie_list=re.findall('<div class="item">.*?<em class="">(.*?)</em>.*?<a href="(.*?)">.*?<span class="title">(.*?)</span>.*?导演: (.*?)主演:(.*?)<br>(.*?)</p>.*?<span class="rating_num" property="v:average">(.*?)</span>.*?<span>(.*?)人评价</span>.*?<span class="inq">(.*?)</span>',html,re.S)
    return movie_list
3. Save data
DEF save_data (Movie): 
    Top, m_url, name, daoyan, the Actor, year_type, Point, the commit, desc = Movie 
    year_type = year_type.strip ( ' \ n- ' ) 
    Data = F '' ' 
    ======== = Welcome to watch movies =========== 
            movie rankings: {top} 
            movie name: {m_url} 
            movies url: {name} 
            film directors: {daoyan} 
            movie starring: {actor} 
            Year type: {year_type } 
            movie ratings: {point} 
            movie review: {commit} 
            The movie: {desc} 
    ========= very grateful to watch =========== 
    \ the n- 
    \ the n- 
         '' ' 
    Print  (data)
    with open ( ' douban_top250.text ' , ' A ' , encoding = ' UTF-. 8 ' ) AS F: 
        f.write (Data) 
    Print (F ' Movie: {name} successfully written ... ' )
 IF  the __name__ = = ' __main__ ' :
     # spliced all Home 
    NUM = 0
     for Line in Range (10 ): 
        URL = F ' https://movie.douban.com/top250?start={num}&filter= ' 
        NUM + = 25
         Print (URL)
         # 1. Each home page to send a request 
        index_res = get_page (url)
         # 2. resolve home page for movie information 
        movie_list = parse_index (index_res.text)
         for Movie in movie_list:
             # 3. Save the data 
            save_data (movie)

:( output style for reference only)

Welcome to watch movies =========== =========
Movie ranking: 1
Movie Name: https: //movie.douban.com/subject/1292052/
movie url: The Shawshank Redemption
film director: Frank Darabont Frank Darabont & nbsp; & nbsp; & nbsp;
movie starring: Tim Robbins Tim Robbins / ...
Year type: 1994 & nbsp; / & nbsp ; USA & nbsp; / & nbsp; crime drama

film score: 9.6
movie review: 1,469,489
film Description: people want freedom.
=========== ========= very grateful to watch

Two selenium request library
    1. What is selenium?
        Opening is an automated testing tool, it is the driving principle of
        the browser to perform some certain good operation. Reptile nature of
        the browser is analog, so you can use it for reptiles.

    2. Why use selenium?
        Advantages:
            - js code execution
            - no need to analyze complex communication process
            - do pop, pull-down and other operations to the browser
            - ***** obtain dynamic data
            - *** cracks login authentication

        disadvantages:
            - Low efficiency

    3, mounting and use
        1. install selenium request library:
            PIP3 install selenium

        2. You must install the browser
            , "Google" or Firefox

        3. install the browser driver
            http://npm.taobao.org/mirrors/chromedriver/2.38/
            Windows:
                Download win32 drive

Second, the basic use of selenium

from the Selenium Import webdriver # Web drive 
from selenium.webdriver.common.by Import By # to find in what way, By.ID, By.CSS_SELECTOR 
from selenium.webdriver.common.keys Import Keys # keyboard key operation 
from   selenium.webdriver.support Import   expected_conditions AS EC # , and together with the following WebDriveWait 
from selenium.webdriver.support.wait Import WebDriverWait # wait for a page to load certain elements 
Import Time
 Import Time
By opening a browser driver
= Driver webdriver.Chrome ()
 the try : 
    driver.get ( ' http://www.jd.com ' )
     # acquires the display objects wait 10 seconds 
    # may wait for 10 seconds to load a label 
    the wait = WebDriverWait (Driver, 10 )
     # id to find elements Key 
    the input_tag = wait.until (EC.presence_of_element_located ( 
        (By.ID, ' Key ' ) 
    )) 
    the time.sleep ( . 5 )
     # in the input box, trade name 
    input_tag.send_keys ( ' doll ' )
     # press keyboard Enter key
     input_tag.send_keys (Keys.ENTER)
    the time.sleep ( 20 )
 a finally :
     # close the browser release operating system resources 
    driver.close ()
Three, selenium selector
from Selenium Import the webdriver   # Web drive 
from selenium.webdriver.common.keys Import Keys   # keyboard operation 
Import Time 

Import Time 


River = webdriver.Chrome ()
 the try : 

    # Implicit wait: call the prior GET 
    # wait for any loading element 10 second 
    driver.implicitly_wait (10 ) 

    driver.get ( ' https://www.baidu.com/ ' ) 

    # explicit wait: call need after GET 
    the time.sleep (. 5)
=============== all methods =================== 
Element is to find a label
elements is to find all tags
Baidu start automatically log on 
to find methods as follows:
1, find_element_by_link_text go through link text
= driver.find_element_by_link_text LOGIN_LINK ( ' login ' ) 
    login_link.click ()   # click Sign 

    the time.sleep ( 1)
2, find_element_by_id the id to look
user_login = driver.find_element_by_id('TANGRAM__PSP_10__footerULoginBtn')
    user_login.click()

    time.sleep(1)
3、find_element_by_class_name
user = driver.find_element_by_class_name('pass-text-input-userName')
    user.send_keys('*****')
4、find_element_by_name
pwd = driver.find_element_by_name('password')
    pwd.send_keys('*****')

    submit = driver.find_element_by_id('TANGRAM__PSP_10__submit')
    submit.click()
5, find_element_by_partial_link_text local link text search
login_link = driver.find_element_by_partial_link_text('')
    login_link.click()
6, find_element_by_css_selector find elements based on attribute selectors 
:. Class
#: the above mentioned id
login2_link = driver.find_element_by_css_selector('.tang-pass-footerBarULogin')
    login2_link.click()
7、find_element_by_tag_name
div = driver.find_elements_by_tag_name('div')
    print(div)


    time.sleep(20)
a finally :
     # close the browser release operating system resources 
    driver.close ()

 

Guess you like

Origin www.cnblogs.com/sde12138/p/11123376.html