Exercise scenario: Some fields are of interest to us on a page, we hope extirpated, other operations. However, these fields may be different parts of the web page. For example, we require the removal of all the mailbox on the page on Baidu.
Split thinking:
1. First, you need to get source contents of the current page, for example, to open a page, right - to view the page source code.
2. find out the law, by the removal of a regular expression to match the field, stored in a dictionary or list.
3. Cycle dictionary or print the contents of the list, Python used for statement to achieve.
Technically implement relevant methods:
1. view the source code of the page, there is a method in Selenium in this drive.page_source obtained;
2.Python with Regular, need REVIEW re module;
3.for email in emails:
print email
First, the specific code:
Selenium the webdriver Import from Import Re Driver = webdriver.Chrome () driver.maximize_window () driver.implicitly_wait (. 6) driver.get ( "http://home.baidu.com/contact.html") # get page source doc driver.page_source = emails = the re.findall (R & lt '[\ W] + @ [\ W \ .-] +', DOC) # regular expressions, @ .XXX.XXX find XXX for in emails in Email: # circulation print matching mailbox print (email)
Explanation:
In the regular expression syntax python, Python string preceded by r represents the original string, by \ w represents the matching alphanumeric and underlined. The re module findall method returns a list of matching substring.
operation result:
Second, the advanced version
from selenium import webdriver import re,time,pprint,xlwt wb = xlwt.Workbook() ws = wb.add_sheet('E-mails') driver = webdriver.Chrome() driver.maximize_window() time.sleep(2) driver.get('https://www.baidu.com/') driver.find_element_by_xpath("//*[@class='lh']/a[text()='关于百度']").click() print(driver.current_window_handle) handles = driver.window_handles for handle in handles: if handle != driver.current_window_handle: driver.close() print("马上切换到新页面", handle) driver.switch_to.window(handle) driver.find_element_by_xpath("//*[@id='indexAdmin']/div[1]/div/div/div/div[2]/ul/li[4]/a").click() # / HTML / body / div [. 1] / div / div / div / div [2] / UL / Li [. 4] / A print(driver.current_window_handle) = driver.window_handles Handles Print (Handles) for handle in Handles: IF = driver.current_window_handle handle:! driver.close () # close the first window print ( 'immediately switch to the new tab', handle) driver.switch_to. window (handle) # switch to the second window # page source code obtained DOC = driver.page_source emiles the re.findall = (R & lt '[\ W] + @ [\ W \ .-] +', DOC) for index, in the enumerate Emile (emiles): ws.write (index, 0, Emile) Print (Emile) wb.save ( 'Baidu Contact email .xls') Print ( 'extraction is complete') driver.close ()
operation result:
Reference article: https://blog.csdn.net/u011541946/article/details/68485981