------- python reptile's no interface crawling (Quick Start)

A. Base

1.PhoantomJS: No browser interface

PhantomJS is a Webkit-based "no interface" (headless) browser that the site will be loaded into memory and
execute JavaScript on the page, because it would not display a graphical interface, it is up and running than the full browser to be efficient.
If we combine Selenium and PhantomJS together, you can run a very powerful web crawling
insects, and reptiles can handle this JavaScrip, Cookie, headers, as well as any of our real user need to do
love.
Note: PhantomJS can be downloaded from its official website http://phantomjs.org/download.html). Because
PhantomJS is a fully functional (although no interface) browser instead of a Python library, so it does not need to
be like other library, like Python installed, but we can be used directly by calling PhantomJS Selenium

2.Selenium:

Selenium is a Web test automation tool, originally developed for the site automated testing and development, the type of image
we play games with the Wizard button, can automatically operate the specified command, except that Selenium can be run directly in
the browser, which supports All major browsers (including PhantomJS these non-browser interface).
Selenium can according to our instructions, so that the browser automatically loads the page to get the data you need, even page
screenshots, or to determine whether the occurrence of certain actions on the site.
Selenium own without browser does not support the browser function, it needs to be combined with third-party browsers together
to use. But sometimes we need to let it run embedded in the code, so we can use PhantomJS called a
tool instead of a real browser.
Selenium can be downloaded from the library https://pypi.python.org/simple/selenium PyPI website, you can also use
pip install a third-party manager with the command: pip install selenium == 2.48.0

 

Two: Getting Started

Slenium API library and a WebDriver of pay. WebDriver bit like a browser can load websites, but he can also be used for the same or other like BeautifulSoup Selector object lookup page elements to interact with the elements on the page (send text, clicks, etc.), and perform other actions to run web crawler .

##导入webdriver
from selenium import webdriver
import time     ### 目的留出充分的时间等待响应

##要想调用键盘按键操作需要 引入keys包
from selenium .webdriver.common.keys import Keys

##调用环境变量指定的PhantomJS浏览器创建浏览器对象
driver = webdriver。PhantomJS()

##如果没有在环境变量指定PhantomJS位置
driver = webdriver.PhantomJS(executable_path=r'd:\Desktop\phantomjs-2.1.1-windows\bin\phantomjs.exe')


##get方法会一直等到页面被完全加载,然后才会继续程序
##通常测试回来这里time.sleep(2)
driver.get("http://www.baidu.com")

##获取页面名为wrapper的id标签的文本内容
data = driver.find_element_by_id("wrapper").text

##打印数据内容
print(data)

##打印页面标题“百度一下,你就知道”
print(driver.title)

##生成当前页面快照并保存
driver.save_screenshot('01.png')

##id="kw"是百度搜索输入框,输入字符串"长城"
driver.find_element_by_id("kw").send_keys("长城")

##id="su"是百度搜索按钮,click()是模拟点击
driver.find_element_by_id("su").click()

##time.sleep(2)这个时间表示服务器响应时间
##获取新的页面快照
driver.save_screenshot('02.png')

##打印网页渲染后的源代码
##print(driver.page_source)

##获取当前页面Cookie
print(driver.get_cookies())

##ctrl+a 全选输入框的内容
driver.find_element_by_id("kw").send_keys(Keys.CONTROL,'a')
##dirver.save_screenshot('03.png')

##ctrl+x 剪切输入框的内容
driver.find_element_by_id("kw").send_keys(Keys.CONTOR,'x')

##输入框重新输入内容
driver.find_element_by_id("kw").send_keys("科比")
driver.save_screenshot("04.png")

##模拟Enter回车键
##time.sleep(6) ##加个延时,不然可能会报错:
#Element is not currently interactable and may not be mainpulated
##driver.find_element_by_id("su").send_keys(Keys.RETURN)
##注意:这个地方用回车是没有效果的,因为输入框在表单当中
##可以使用表单提交完请求
driver.find_element_by_id("su").submit()
time.sleep(2)
driver.save_screenshot('05.png')

 

 

Guess you like

Origin blog.csdn.net/weixin_43567965/article/details/90244665