Understand the basic principles and applications of selenium+phantomjs crawling most website data

1, Preface:Why can selenium+phantomjs obtain most website data?

Reason: The response content obtained by the normal requests module is all js code, becausethe response content obtained by many websites is all js code, no Page data cannot be obtained through xpath and other methods of extracting data.

Selenium+phantomjs can do this. After obtaining the page data, run the js code in the response to completely obtain the desired data.

2, Understandingselenium and phantom

Selenium: It is a web automation testing tool that does not have browser functionality.

phantomjs: headless browser

What is a browser?

(1) Load the page; (2) Display html; (3) View xml.

3, Installation and useselenium and phantom:

    Download and install chromedriver: Search the chromedriver image on Baidu and download the version corresponding to your chrom browser.
    Download and install phantomjs: Search phantomjs mirror on Baidu.
    The installation is very simple: find the two exe files and copy them to: C:\Anaconda3\Scripts
    Check whether it is successful: open cmd and enter respectively: if If no error is reported, the setting is successful.

4, selenium uses:
    
    #1, define a browser driver
    driver = webdriver.Chrome()
    #2, request url
    url = 'https://book.douban.com/subject_search?search_text=python&cat=1001'
    driver.get(url)
    #3, waiting time
    time.sleep(3)
    #4, Get page content
    print(driver.page_source)

5, the main methods of locating ui elements are:
        find_element_by_id
        find_elements_by_xpath--
        find_elements_by_css_selector
    
    Driver.find_element_by_id('inp-query') The return value of this method of locating elements is a WebElement object.

Document: Summary of common methods of selenuim.note
    Link: http://note.youdao.com/noteshare?id=0142a95cf23fadbaea95809ccb5674b2&sub=02896A50836E4995997A821419D9A063

 6. Wait

Definition: What is wait? Why use wait?

In Selenium in the crawler,Waiting means waiting for all requested data elements in the web page to appear in order to run the next program to obtain the desired data. Most web pages now use AJAX technology, but after a web page is loaded by the browser, the elements in it may be loaded at different time intervals. This makeslocating elements difficult because the elements are not loaded all at once. In this case, if the element does not appear in the current DOM, ElementNotVisibleException will be returned when locating the element. In this case, using wait can solve the problem.

Category:
(1) Forced waiting:
    time.sleep(3)
    The program will be here Forced to wait for 3 seconds, regardless of whether the page is completed or not, the following program will be executed after waiting for 3 seconds.
(2) Implicit wait
    driver.implicitly_wait(30)
    The 30 here is a maximum waiting time.
    Implicit waiting means that within 30 seconds, as long as the page is fully loaded, for example, if the browser does not circle around when requesting a page, the following code will be executed.
    If the loading has not been completed within this time, the code below "wait" will also be executed.
    Disadvantages of implicit waiting:
        Sometimes the elements you want on the page have already been loaded, but because some js and other things are very slow, I You still have to wait until the page is complete before proceeding to the next step
.
(3) Explicit wait
    WebDriverWait.until(a)--wait until a is satisfied
    WebDriverWait.until( EC.presence_of_element_located(locator))
        EC---expected_conditions: Selenium’s built-in condition judgment class
    
    EC.presence_of_element_located((By.CSS_SELECTOR,' .ui-page > wrap'))
    EC.presence_of_all_elements_located((By.CSS_SELECTOR,'.ui-page'))
        These two conditions verify whether the element appears. The parameters passed in are all tuple type locators, such as (By.ID, 'kw')
        Only one of each The elements that meet the conditions will pass when loaded;
        The other element must be loaded with all the elements that meet the conditions
    EC.element_to_be_clickable    
        This condition determines whether the element is clickable. Pass in locator
    and explicitly wait for the code of the module or package that needs to be imported:
    from selenium.webdriver.support. wait import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    
     Steps when you wait for a display:
        (1) Wait = WebdriverWait (
                    Driver,#用
                    20, the maximum waiting Duration
                    0.5, search interval, the default is 0.5
                )
        (2) wait.until(EC.presence_of_element_located(locator) )---》webelement object---》corresponds to the element specified by locator
    
            locator = (
                The type can be done in three ways (by.id, by.css.selector, by.xpath)
                Implicit waiting and explicit waiting can be used at the same time, but please note: the longest waiting time is the larger of the two
            )
    
    
    

Guess you like

Origin blog.csdn.net/Smile_Lai/article/details/101715205