Reptile learning 08.Python web crawler's picture lazy loading technique, selenium and PhantomJS

Reptile learning 08.Python web crawler's picture lazy loading technique, selenium and PhantomJS

Introduced

Overview Today

  • Pictures lazy loading
  • selenium
  • phantomJs
  • Google headless browser

Review of knowledge

  • PIN processing flow

Details today

Dynamic data loading process

A. Image lazy load

  • What is the picture lazy loading?

    • Case Study: crawl image data in Webmaster material http://sc.chinaz.com/

      #!/usr/bin/env python
      # -*- coding:utf-8 -*-
      import requests
      from lxml import etree
      
      if __name__ == "__main__":
           url = 'http://sc.chinaz.com/tupian/gudianmeinvtupian.html'
           headers = {
               'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
           }
           #获取页面文本数据
           response = requests.get(url=url,headers=headers)
           response.encoding = 'utf-8'
           page_text = response.text
           #解析页面数据(获取页面中的图片链接)
           #创建etree对象
           tree = etree.HTML(page_text)
           div_list = tree.xpath('//div[@id="container"]/div')
           #解析获取图片地址和图片的名称
           for div in div_list:
               image_url = div.xpath('.//img/@src')
               image_name = div.xpath('.//img/@alt')
               print(image_url) #打印图片链接
               print(image_name)#打印图片名称

      imgClick and drag to move

    • - operating results observed, we can get the name of the picture, but the link for the empty examination found xpath expression is not a problem, the reason lies in the Where is it?

    • Pictures lazy loading concept:

      • Pictures lazy loading a web page optimization techniques. Pictures as a network resource, but also as an ordinary static resources when requested, the network resources, but all the pictures at one time the entire page is finished loading, the first screen will greatly increase page load times. To solve this problem, through the front and rear end with them to load the picture when the picture appears only in the current browser window, to reduce the number of requests for technical picture first screen is called the "picture lazy loading."
    • Website general picture how lazy loading technology?

      • The page source code, will first use a "pseudo-attributes" in the img tag (usually src2, original ......) to store links rather than the real picture is stored directly in the src attribute. When the image appears to the visual area of ​​the page, the pseudo-dynamic properties will replace the src attribute, and download the image.
    • Master Case material subsequent analysis: After careful observation of the structure of the page by finding links Web page images are stored in a property in this pseudo src2

      #!/usr/bin/env python
      # -*- coding:utf-8 -*-
      import requests
      from lxml import etree
      
      if __name__ == "__main__":
           url = 'http://sc.chinaz.com/tupian/gudianmeinvtupian.html'
           headers = {
               'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36',
           }
           #获取页面文本数据
           response = requests.get(url=url,headers=headers)
           response.encoding = 'utf-8'
           page_text = response.text
           #解析页面数据(获取页面中的图片链接)
           #创建etree对象
           tree = etree.HTML(page_text)
           div_list = tree.xpath('//div[@id="container"]/div')
           #解析获取图片地址和图片的名称
           for div in div_list:
               image_url = div.xpath('.//img/@src2') #src2伪属性
               image_name = div.xpath('.//img/@alt')
               print(image_url) #打印图片链接
               print(image_name)#打印图片名称

      imgClick and drag to move

Brief introduction

Initially selenium is an automated testing tool, and reptiles use it mainly to solve the problem of the nature of selenium requests can not execute JavaScript code is directly driven by the browser, fully simulate the operation of the browser, such as jumps, input, click the drop-down, etc. after the results to get the page rendering to support multiple browsers

Environment Installation

  • Download and install selenium: pip install selenium
  • Download the browser driver:
    • http://chromedriver.storage.googleapis.com/index.html
  • Check the mapping between drive and browser versions:
    • http://blog.csdn.net/huilan_same/article/details/51896672

Simple to use / display effects

from selenium import webdriver
from time import sleep

# 后面是你的浏览器驱动位置,记得前面加r'','r'是防止字符转义的
driver = webdriver.Chrome(r'驱动程序路径')
# 用get打开百度页面
driver.get("http://www.baidu.com")
# 查找页面的“设置”选项,并进行点击
driver.find_elements_by_link_text('设置')[0].click()
sleep(2)
# # 打开设置后找到“搜索设置”选项,设置为每页显示50条
driver.find_elements_by_link_text('搜索设置')[0].click()
sleep(2)

# 选中每页显示50条
m = driver.find_element_by_id('nr')
sleep(2)
m.find_element_by_xpath('//*[@id="nr"]/option[3]').click()
m.find_element_by_xpath('.//option[3]').click()
sleep(2)

# 点击保存设置
driver.find_elements_by_class_name("prefpanelgo")[0].click()
sleep(2)

# 处理弹出的警告页面   确定accept() 和 取消dismiss()
driver.switch_to_alert().accept()
sleep(2)
# 找到百度的输入框,并输入 美女
driver.find_element_by_id('kw').send_keys('美女')
sleep(2)
# 点击搜索按钮
driver.find_element_by_id('su').click()
sleep(2)
# 在打开的页面中找到“Selenium - 开源中国社区”,并打开这个页面
driver.find_elements_by_link_text('美女_百度图片')[0].click()
sleep(3)

# 关闭浏览器
driver.quit()

imgClick and drag to move

Browser to create

Selenium supports a lot of browsers, such as Chrome, Firefox, Edge, etc., as well as Android, BlackBerry and other mobile terminal browser. In addition, the interface also supports non-browser PhantomJS.

from selenium import webdriver
  
browser = webdriver.Chrome()
browser = webdriver.Firefox()
browser = webdriver.Edge()
browser = webdriver.PhantomJS()
browser = webdriver.Safari()

imgClick and drag to move

Positioning elements

webdriver provides a range of element positioning method, commonly used are the following:

find_element_by_id()
find_element_by_name()
find_element_by_class_name()
find_element_by_tag_name()
find_element_by_link_text()
find_element_by_partial_link_text()
find_element_by_xpath()
find_element_by_css_selector()

imgClick and drag to move

note

1, find_element_by_xxx looking for a qualified first label, find_elements_by_xxx looking for all eligible label.

2, according to ID, CSS selectors and XPath acquired, they return the results exactly.

3 In addition, Selenium also provides a general method find_element(), it takes two arguments: Find ways Byand values. In fact, it is the find_element_by_id()generic version of the function of this method, for example find_element_by_id(id)is equivalent to find_element(By.ID, id)the result obtained both identical.

Node interaction

Selenium can drive the browser to perform some operations, that allows the browser to simulate perform some action. The more common uses are: When entering text using send_keys()the method, when using empty words clear()method used when you click the button click()approach. Examples are as follows:

from selenium import webdriver
import time
 
browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input = browser.find_element_by_id('q')
input.send_keys('MAC')
time.sleep(1)
input.clear()
input.send_keys('IPhone')
button = browser.find_element_by_class_name('btn-search')
button.click()
browser.quit()

imgClick and drag to move

Action Chain

In the example above, a number of interactive actions are performed for a node. For example, for an input box, we call it the empty text and text input methods; for the button, you call its click method. In fact, there are some other operations that do not perform specific objects, such as dragging a mouse, keyboard keys, etc., to perform these operations another way, that the operation of the chain.

For example, a node is now achieved a drag operation, the drag from a certain node to another one, can be achieved:

from selenium import webdriver
from selenium.webdriver import ActionChains
import time
browser = webdriver.Chrome()
url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
browser.get(url)
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
target = browser.find_element_by_css_selector('#droppable')
actions = ActionChains(browser)
# actions.drag_and_drop(source, target)
# actions.perform() #执行动作链
actions.click_and_hold(source)
time.sleep(3)
for i in range(5):
    actions.move_by_offset(xoffset=17,yoffset=0).perform()
    time.sleep(0.5)

actions.release()

imgClick and drag to move

JavaScript execution

For some operations, Selenium API does not provide. For example, pull down the progress bar, it can be directly simulated running JavaScript, this time using the execute_script()method may be implemented, as follows:

from selenium import webdriver
 
browser = webdriver.Chrome()
browser.get('https://www.jd.com/')
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
browser.execute_script('alert("123")')

imgClick and drag to move

Get the page source data

By page_sourcemay acquire the source code of web page attributes, then parsing library can be used (e.g., regular expressions, Beautiful Soup, pyquery etc.) to extract the information.

Forward and backward

#模拟浏览器的前进后退
import time
from selenium import webdriver
 
browser=webdriver.Chrome()
browser.get('https://www.baidu.com')
browser.get('https://www.taobao.com')
browser.get('http://www.sina.com.cn/')
 
browser.back()
time.sleep(10)
browser.forward()
browser.close()

imgClick and drag to move

Cookie Handling

Use Selenium, for Cookies can also easily perform operations such as get, add, delete Cookies and so on. Examples are as follows:

from selenium import webdriver
 
browser = webdriver.Chrome()
browser.get('https://www.zhihu.com/explore')
print(browser.get_cookies())
browser.add_cookie({'name': 'name', 'domain': 'www.zhihu.com', 'value': 'germey'})
print(browser.get_cookies())
browser.delete_all_cookies()
print(browser.get_cookies())

imgClick and drag to move

Exception Handling

from selenium import webdriver
from selenium.common.exceptions import TimeoutException,NoSuchElementException,NoSuchFrameException

try:
    browser=webdriver.Chrome()
    browser.get('http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable')
    browser.switch_to.frame('iframssseResult')

except TimeoutException as e:
    print(e)
except NoSuchFrameException as e:
    print(e)
finally:
    browser.close()

imgClick and drag to move

phantomJS

PhantomJS is a non-browser interface, automate operational processes and the operation Google browser is the same. Because it is no interface, in order to be able to show automate operational processes, PhantomJS provides users with a screenshot function, the use of save_screenshot function implementation.

from selenium import webdriver
import time

# phantomjs路径
path = r'PhantomJS驱动路径'
browser = webdriver.PhantomJS(path)

# 打开百度
url = 'http://www.baidu.com/'
browser.get(url)

time.sleep(3)

browser.save_screenshot(r'phantomjs\baidu.png')

# 查找input输入框
my_input = browser.find_element_by_id('kw')
# 往框里面写文字
my_input.send_keys('美女')
time.sleep(3)
#截屏
browser.save_screenshot(r'phantomjs\meinv.png')

# 查找搜索按钮
button = browser.find_elements_by_class_name('s_btn')[0]
button.click()

time.sleep(3)

browser.save_screenshot(r'phantomjs\show.png')

time.sleep(3)

browser.quit()

imgClick and drag to move

Google headless browser

Since PhantomJs recently stopped updating and maintenance, so we recommend using Google headless browser, it is a non-browser interface of Google.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
 
# 创建一个参数对象,用来控制chrome以无界面模式打开
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
# 驱动路径
path = r'C:\Users\ZBLi\Desktop\1801\day05\ziliao\chromedriver.exe'
 
# 创建浏览器对象
browser = webdriver.Chrome(executable_path=path, chrome_options=chrome_options)
 
# 上网
url = 'http://www.baidu.com/'
browser.get(url)
time.sleep(3)
 
browser.save_screenshot('baidu.png')
 
browser.quit()

imgClick and drag to move

Login qq space, crawling data

import requests
from selenium import webdriver
from lxml import etree
import time

driver = webdriver.Chrome(executable_path='/Users/bobo/Desktop/chromedriver')
driver.get('https://qzone.qq.com/')
#在web 应用中经常会遇到frame 嵌套页面的应用,使用WebDriver 每次只能在一个页面上识别元素,对于frame 嵌套内的页面上的元素,直接定位是定位是定位不到的。这个时候就需要通过switch_to_frame()方法将当前定位的主体切换了frame 里。
driver.switch_to.frame('login_frame')
driver.find_element_by_id('switcher_plogin').click()

#driver.find_element_by_id('u').clear()
driver.find_element_by_id('u').send_keys('328410948')  #这里填写你的QQ号
#driver.find_element_by_id('p').clear()
driver.find_element_by_id('p').send_keys('xxxxxx')  #这里填写你的QQ密码
    
driver.find_element_by_id('login_button').click()
time.sleep(2)
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(2)
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(2)
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(2)
page_text = driver.page_source

tree = etree.HTML(page_text)
#执行解析操作
li_list = tree.xpath('//ul[@id="feed_friend_list"]/li')
for li in li_list:
    text_list = li.xpath('.//div[@class="f-info"]//text()|.//div[@class="f-info qz_info_cut"]//text()')
    text = ''.join(text_list)
    print(text+'\n\n\n')
    
driver.close()

imgClick and drag to move

As much crawling movie information watercress Network

from selenium import webdriver
from time import sleep
import time

if __name__ == '__main__':
    url = 'https://movie.douban.com/typerank?type_name=%E6%81%90%E6%80%96&type=20&interval_id=100:90&action='
    # 发起请求前,可以让url表示的页面动态加载出更多的数据
    path = r'C:\Users\Administrator\Desktop\爬虫授课\day05\ziliao\phantomjs-2.1.1-windows\bin\phantomjs.exe'
    # 创建无界面的浏览器对象
    bro = webdriver.PhantomJS(path)
    # 发起url请求
    bro.get(url)
    time.sleep(3)
    # 截图
    bro.save_screenshot('1.png')

    # 执行js代码(让滚动条向下偏移n个像素(作用:动态加载了更多的电影信息))
    js = 'window.scrollTo(0,document.body.scrollHeight)'
    bro.execute_script(js)  # 该函数可以执行一组字符串形式的js代码
    time.sleep(2)

    bro.execute_script(js)  # 该函数可以执行一组字符串形式的js代码
    time.sleep(2)
    bro.save_screenshot('2.png') 
    time.sleep(2) 
    # 使用爬虫程序爬去当前url中的内容 
    html_source = bro.page_source # 该属性可以获取当前浏览器的当前页的源码(html) 
    with open('./source.html', 'w', encoding='utf-8') as fp: 
        fp.write(html_source) 
    bro.quit()

imgClick and drag to move

selenium evade detection and identification is

Many large sites have now adopted a monitoring mechanism for selenium. For example, under normal circumstances we window.navigator.webdriver visit sites such as Taobao is a browser
undefined. Selenium is used to access the value is true. So how to solve this problem?

Only need to set the startup parameters Chromedriver to solve the problem. Before starting Chromedriver, turn on the Chrome experimental feature parameters excludeSwitches, it is ['enable-automation'], complete code is as follows:

from selenium.webdriver import Chrome
from selenium.webdriver import ChromeOptions

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
driver = Chrome(options=option)

imgClick and drag to move

operation

  • Crawling news headlines and news content in the domestic sector Netease news

Guess you like

Origin www.cnblogs.com/bky20061005/p/12173567.html