Introduction
pyppeteer is an unofficial Python version of the Puppeteer library, a browser automation library, developed by Japanese engineers.
Puppeteer is a tool developed by Google based on Node.js. It calls Chrome's API and uses JavaScript code to manipulate Chrome to complete some operations for web crawlers and automatic web program testing.
pyppeteer uses the Python asynchronous coroutine library asyncio , which can integrate Scrapy for distributed crawling.
Pyppeteer is not well maintained; puppet puppets; puppeteer puppet manipulators.
installation
- installation
pyppeteer
pip install pyppeteer
2. Installation Chromium
pyppeteer-install
Note: Chromium will be automatically downloaded when running pyppeteer for the first time (Chrome's experimental version, about 150MB)
If the Chromium
installation fails, you can download it manually
3. View Chromium storage path
import pyppeteer print(pyppeteer.__chromium_revision__) # View version number
print(pyppeteer.executablePath()) # View Chromium storage path
# 588429
# C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\chrome-win32\chrome.exe
Unzip to: C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\
under
chrome-win
Rename the folder chrome-win32
to
For configuration details, see Pyppeteer Environment Variables
First test
Open Baidu and take a screenshot
import asyncio
from pyppeteer import launch
async def main():
browser = await launch(headless=False) # 关闭无头浏览器
page = await browser.newPage()
await page.goto('https://www.baidu.com/') # 跳转
await page.screenshot({'path': 'example.png'}) # 截图
await browser.close() # 关闭
asyncio.get_event_loop().run_until_complete(main())
Specify browser path
Specify parameters executablePath
import asyncio
from pyppeteer import launch
async def main():
browser = await launch(headless=False, executablePath=r'C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\chrome-win32\chrome.exe') # 关闭无头浏览器
page = await browser.newPage()
await page.goto('https://www.baidu.com/') # 跳转
await page.screenshot({'path': 'example.png'}) # 截图
await browser.close() # 关闭
asyncio.get_event_loop().run_until_complete(main())
Remove Chrome is being controlled by automated testing software
import asyncio
from pyppeteer import launch
async def main():
browser = await launch(headless=False, ignoreDefaultArgs=['--enable-automation']
input()
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
full screen
import tkinter
import asyncio
from pyppeteer import launch
def screen_size():
tk = tkinter.Tk()
width = tk.winfo_screenwidth()
height = tk.winfo_screenheight()
tk.quit()
return {'width': width, 'height': height}
async def main():
browser = await launch(headless=False, args=['--start-maximized']) # 页面全屏
page = await browser.newPage()
await page.setViewport(screen_size()) # 内容全屏
await page.goto('https://www.baidu.com/')
input()
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Page content
Page.content()
or Page.evaluate()
import asyncio
from pyppeteer import launch
async def main():
browser = await launch(headless=False)
page = await browser.newPage()
url = 'https://www.baidu.com/'
await page.goto(url)
# content = await page.content()
content = await page.evaluate('document.body.textContent', force_expr=True)
print(content)
input()
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
Run asynchronously
asyncio.wait()
Or asyncio.gather()
, it is recommended to be used only on pages that need to be read once, not recommended for scrolling
import asyncio
from pyppeteer import launch
async def crawl(url):
browser = await launch(headless=False)
page = await browser.newPage()
await page.goto(url)
title= await page.title()
print(title)
print(title)
await browser.close()
async def main():
urls = [
crawl('https://www.baidu.com/'),
crawl('https://www.bing.com/')
]
await asyncio.wait(urls)
# await asncio.gather(*urls)
asyncio.get_event_loop().run_until_complete(main())
# 百度一下,你就知道
# 微软 Bing 搜索 - 国内版
报错 OSError: Unable to remove Temporary User Data
Specify the parameter userDataDir to store the cache when starting the browser to ensure that the hard disk is large and not a system disk
import asyncio
from pyppeteer import launch
async def main():
browser = await launch(headless=False, userDataDir='./cache/')
input()
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
报错 pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.
Package
import asyncio
from pyppeteer import launch
async def main():
browser = await launch(headless=False, ignoreDefaultArgs=['--enable-automation'], userDataDir='./cache/') #
page = await browser.newPage()
await page.setViewport({'width': 1366, 'height': 768}) # 内容铺满
await page.goto('https://www.baidu.com/') # 跳转
input('回车退出')
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
references
- pyppeteer/pyppeteer: Headless chrome/chromium automation library
- Pyppeteer Documentation
- Chromium - The Chromium Projects
- Pyppeteer Environment Variables
- Pyppeteer environment construction, common parameters and 2 cases
- Pyppeteer bugs encountered and solutions
- pyppeteer crawls Jingdong Mall and Taobao sample code|Get cookie crawl search content
- pyppeteer tutorial
- pyppeteer: Solve the problem of OSError: Unable to remove Temporary User Data
- Solve the problem of pyppeteer navigation timeout: pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 30000 ms exceeded.
- asyncio (million concurrency) of python asynchronous programming
- Getting started with Python asynchronous programming