Python crawler artifact pyppeteer

Introduction

pyppeteer  is an unofficial Python version of the Puppeteer library, a browser automation library, developed by Japanese engineers.

Puppeteer  is a tool developed by Google based on Node.js. It calls Chrome's API and uses JavaScript code to manipulate Chrome to complete some operations for web crawlers and automatic web program testing.

pyppeteer  uses the Python asynchronous coroutine library  asyncio , which can integrate Scrapy for distributed crawling.

Pyppeteer is not well maintained; puppet puppets; puppeteer puppet manipulators.

Insert picture description here

installation

  1. installation pyppeteer
pip install pyppeteer

 

    2. Installation Chromium

pyppeteer-install

 

Note: Chromium will be automatically downloaded when running pyppeteer for the first time (Chrome's experimental version, about 150MB)

If the  Chromium installation fails, you can download it manually

Insert picture description here

3. View Chromium storage path

import pyppeteer print(pyppeteer.__chromium_revision__) # View version number

print(pyppeteer.executablePath()) # View Chromium storage path

# 588429

# C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\chrome-win32\chrome.exe

Unzip to: C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\ under

chrome-win Rename the  folder  chrome-win32 to

Insert picture description here

For configuration details, see  Pyppeteer Environment Variables

First test

Open Baidu and take a screenshot

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch(headless=False)  # 关闭无头浏览器
    page = await browser.newPage()
    await page.goto('https://www.baidu.com/') # 跳转
    await page.screenshot({'path': 'example.png'})  # 截图
    await browser.close() # 关闭

asyncio.get_event_loop().run_until_complete(main())

Specify browser path

Specify parameters executablePath

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch(headless=False, executablePath=r'C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\chrome-win32\chrome.exe')  # 关闭无头浏览器
    page = await browser.newPage()
    await page.goto('https://www.baidu.com/') # 跳转
    await page.screenshot({'path': 'example.png'}) # 截图
    await browser.close()  # 关闭

asyncio.get_event_loop().run_until_complete(main())

Remove Chrome is being controlled by automated testing software

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch(headless=False, ignoreDefaultArgs=['--enable-automation']
    input()
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Insert picture description here

full screen

import tkinter
import asyncio
from pyppeteer import launch

def screen_size():
    tk = tkinter.Tk()
    width = tk.winfo_screenwidth()
    height = tk.winfo_screenheight()
    tk.quit()
    return {'width': width, 'height': height}

async def main():
    browser = await launch(headless=False, args=['--start-maximized']) # 页面全屏
    page = await browser.newPage()
    await page.setViewport(screen_size()) # 内容全屏
    await page.goto('https://www.baidu.com/')
    input()
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Page content

Page.content() or Page.evaluate()

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch(headless=False)
    page = await browser.newPage()
    url = 'https://www.baidu.com/'
    await page.goto(url)
    # content = await page.content()
    content = await page.evaluate('document.body.textContent', force_expr=True)
    print(content)
    input()
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Run asynchronously

asyncio.wait() Or  asyncio.gather(), it is recommended to be used only on pages that need to be read once, not recommended for scrolling

import asyncio
from pyppeteer import launch

async def crawl(url):
    browser = await launch(headless=False)
    page = await browser.newPage()
    await page.goto(url)
    title= await page.title()
    print(title)
    print(title)
    await browser.close()

async def main():
    urls = [
        crawl('https://www.baidu.com/'),
        crawl('https://www.bing.com/')
    ]
    await asyncio.wait(urls)
    # await asncio.gather(*urls)

asyncio.get_event_loop().run_until_complete(main())
# 百度一下,你就知道
# 微软 Bing 搜索 - 国内版

报错 OSError: Unable to remove Temporary User Data

Specify the parameter userDataDir to store the cache when starting the browser to ensure that the hard disk is large and not a system disk

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch(headless=False, userDataDir='./cache/')
    input()
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

报错 pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.



 

Package

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch(headless=False, ignoreDefaultArgs=['--enable-automation'], userDataDir='./cache/') #
    page = await browser.newPage()
    await page.setViewport({'width': 1366, 'height': 768})  # 内容铺满
    await page.goto('https://www.baidu.com/') # 跳转
    input('回车退出')
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

references

  1. pyppeteer/pyppeteer: Headless chrome/chromium automation library
  2. Pyppeteer Documentation
  3. Chromium - The Chromium Projects
  4. Pyppeteer Environment Variables
  5. Pyppeteer environment construction, common parameters and 2 cases
  6. Pyppeteer bugs encountered and solutions
  7. pyppeteer crawls Jingdong Mall and Taobao sample code|Get cookie crawl search content
  8. pyppeteer tutorial
  9. pyppeteer: Solve the problem of OSError: Unable to remove Temporary User Data
  10. Solve the problem of pyppeteer navigation timeout: pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.
  11. asyncio (million concurrency) of python asynchronous programming
  12. Getting started with Python asynchronous programming

Guess you like

Origin blog.csdn.net/zhangge3663/article/details/108201064