web crawlers explain -PhantomJS virtual browser + selenium module operating PhantomJS

PhantomJS virtual browser

phantomjs is based on the webkit core js headless browser that is not the browser display interface, the use of this software, you can get any information web site js loaded, that is, you can get information about the browser loaded asynchronously

web crawlers explain -PhantomJS virtual browser + selenium module operating PhantomJS

PhantomJS download, unzip the file, unzip the folder, cut to the python installation folder

web crawlers explain -PhantomJS virtual browser + selenium module operating PhantomJS

Then PhantomJS folder in the bin folder to add system environment variables

web crawlers explain -PhantomJS virtual browser + selenium module operating PhantomJS

cdm enter the command: PhantomJS the following message appears the installation was successful

web crawlers explain -PhantomJS virtual browser + selenium module operating PhantomJS

a selenium python module is a module operating software PhantomJS

selenium PhantomJS software module

webdriver.PhantomJS () to instantiate PhantomJS browser object
get ( 'url') visit
find_element_by_xpath ( 'xpath expression') to find the corresponding element xpath expression by
clear () empty the contents of the input box
send_keys ( 'content') will write the contents of the input box
click () click event
get_screenshot_as_file ( 'save the screenshot path name') web page screenshots, saved to this directory
page_source obtain the source code pages htnl
quit () close PhantomJS browser

在学习过程中有什么不懂得可以加我的
python学习交流扣扣qun,784758214
群里有不错的学习视频教程、开发工具与电子书籍。
与你分享python企业当下人才需求及怎么从零基础学习好python,和学习什么内容
#!/usr/bin/env python
# -*- coding:utf8 -*-
from selenium import webdriver  #导入selenium模块来操作PhantomJS
import os
import time
import re

llqdx = webdriver.PhantomJS()  #实例化PhantomJS浏览器对象
llqdx.get("https://www.baidu.com/") #访问网址

# time.sleep(3)   #等待3秒
# llqdx.get_screenshot_as_file('H:/py/17/img/123.jpg')  #将网页截图保存到此目录

#模拟用户操作
llqdx.find_element_by_xpath('//*[@id="kw"]').clear()                    #通过xpath表达式找到输入框,clear()清空输入框里的内容
llqdx.find_element_by_xpath('//*[@id="kw"]').send_keys('叫卖录音网')     #通过xpath表达式找到输入框,send_keys()将内容写入输入框
llqdx.find_element_by_xpath('//*[@id="su"]').click()                    #通过xpath表达式找到搜索按钮,click()点击事件

time.sleep(3)   #等待3秒
llqdx.get_screenshot_as_file('H:/py/17/img/123.jpg')  #将网页截图,保存到此目录

neir = llqdx.page_source   #获取网页内容
print(neir)
llqdx.quit()    #关闭浏览器

pat = "<title>(.*?)</title>"
title = re.compile(pat).findall(neir)  #正则匹配网页标题
print(title)

If you are still confused in the programming world, you can join us to learn Python buckle qun: 784758214, look at how seniors are learning. Exchange of experience. From basic web development python script to, reptiles, django, data mining and other projects to combat zero-based data are finishing. Given to every little python partner! Share some learning methods and need to pay attention to small details, click on Join us python learner gathering

PhantomJS browser camouflage, and scroll bar to load data

Some sites are dynamic loading data, you need to scroll bar to scroll load data

web crawlers explain -PhantomJS virtual browser + selenium module operating PhantomJS

Implementation code

DesiredCapabilities camouflage browser object
execute_script () js code execution

current_url get the current url

#!/usr/bin/env python
# -*- coding:utf8 -*-
from selenium import webdriver  #导入selenium模块来操作PhantomJS
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities   #导入浏览器伪装模块
import os
import time
import re

dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap['phantomjs.page.settings.userAgent'] = ('Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0')
print(dcap)
llqdx = webdriver.PhantomJS(desired_capabilities=dcap)  #实例化PhantomJS浏览器对象

llqdx.get("https://www.jd.com/") #访问网址

#模拟用户操作
for j in range(20):
    js3 = 'window.scrollTo('+str(j*1280)+','+str((j+1)*1280)+')'
    llqdx.execute_script(js3)  #执行js语言滚动滚动条
    time.sleep(1)

llqdx.get_screenshot_as_file('H:/py/17/img/123.jpg')  #将网页截图,保存到此目录

url = llqdx.current_url
print(url)

neir = llqdx.page_source   #获取网页内容
print(neir)
llqdx.quit()    #关闭浏览器

pat = "<title>(.*?)</title>"
title = re.compile(pat).findall(neir)  #正则匹配网页标题
print(title)

web crawlers explain -PhantomJS virtual browser + selenium module operating PhantomJS

Guess you like

Origin blog.51cto.com/14510224/2435245