python reptile artifact Pyppeteer entry and use

This article describes the python reptile artifact Pyppeteer entry and use, the paper describes in great detail by the example code, a certain reference value of learning for all of us to learn or work, a friend in need can refer to the
Introduction

Lift selenium presumably everyone is familiar with, as a well-known Web automated testing framework, selenium support a variety of mainstream browsers, provides a feature-rich API interface, we often used crawler tool to use. But the disadvantage of selenium is also very obvious, such as too slow for demanding version of the configuration, is often the most troublesome to update the corresponding drive.

Today gave you another web automated testing tool Pyppeteer, although relatively simple supported browser, but in terms of ease of installation and configuration of operating efficiency should be far better than selenium.

01.Pyppeteer Profile

Before the introduction Pyppeteer Let me talk about Puppeteer, Puppeteer is Google produced a Node.js based on the development of a tool primarily used API to manipulate the Chrome browser, Chrome browser manipulated by Javascript code to complete the data crawling, Web automated testing procedures and other tasks.

Pyppeteer actually Python version Puppeteer, the following is a brief two features under the Pyppeteer, chromium browser and asyncio framework:

1).chromium

Chromium is a standalone browser, Google is to develop its own browser Google Chrome and open plan, the equivalent of an experimental version of Chrome, Chromium is less stable than Chrome but more feature-rich, and updated quickly, usually every every few hours there is a new development release.

Pyppeteer web-based automation chromium to achieve, because of certain characteristics of chromium in, Pyppeteer installation configuration is very simple, on this point we will detail later.

2) .asyncio

asyncio is an asynchronous Python coroutine library, since the introduction of version 3.4 of the standard library directly built-in support for asynchronous IO, and claims to be the most ambitious Python library official website has a very detailed introduction: Here Insert Picture Description
02. Installation and use

1). Installation minimalist

Use pip install pyppeteer order to complete the installation pyppeteer library, as chromium browser, you only need a pyppeteer-install command will automatically download the latest version of the corresponding chromium browser to the default location of pyppeteer.

If you do not run pyppeteer-install command will automatically download and install chromium browser use pyppeteer the first time, the effect is the same. Overall, pyppeteer than selenium eliminating the link driver configuration.

Of course, for some reason, may also occur chromium automatic installation can not be successfully completed, then you can consider manually install: First, find the corresponding version of their own systems from the following URL, download chromium archive;

'Linux': 'https://storage.googleapis.com/chromium-browser-snapshots/Linux_x64/575458/chrome-linux.zip'
'MAC': 'https://storage.googleapis.com/chromium-browser- snapshots / the Mac / 575 458 / chrome-mac.zip '
' Win32 ':' https://storage.googleapis.com/chromium-browser-snapshots/Win/575458/chrome-win32.zip '
' Win64 ':' HTTPS: //storage.googleapis.com/chromium-browser-snapshots/Win_x64/575458/chrome-win32.zip '
then the decompressed compressed into the specified directory pyppeteer under default directory windows system. In other systems the default directory may refer to the following figure depicts: Here Insert Picture Description
2).

After the installation Try it effect. Together look at the following code in the main function, first set up a browser object, and then open a new tab, access the Baidu home page, current page screenshot and save it as "example.png", and finally close the browser. Also previously mentioned, pyppeteer is based asyncio built, so when in use need to use async / await structures. Here Insert Picture Description
Running the above code will find a pop-up browser does not run, which is used by default because Pyppeteer headless browser, the browser displays if you want, you need to set the parameters "headless = False" in the launch function, the program ends after there will be intercepted page picture in the same directory: Here Insert Picture Description
03. asynchronous combat crawling Fund

我们前面一直在说Pyppeteer是一款非常高效的web自动化测试工具,其本质原因是由于Pyppeteer是基于asyncio构建的,它的所有属性和方法几乎都是coroutine对象,因此在构建异步程序的时候非常方便,天生就支持异步运行。

下面就来对比顺序执行和异步运行的效率究竟如何:

1).基金爬取

我们把天天基金网中的开放式基金净值数据爬取作为本次的实验任务,下面这张图是一支基金的历史净值数据,这个页面是js加载的,没办法通过requests直接获取内容信息,因此可以考虑使用模拟浏览器操作的方式进行数据抓取。(事实上基金净值数据的获取是有API接口的,本次任务只是为了演示,不具备实用价值)Here Insert Picture Description
为了使效果更加明显,我们此次爬取基金列表页(下图)前50支基金的近20个交易日的净值数据。Here Insert Picture Description
2).顺序执行

程序构建的基本思路是新建一个browser浏览器和一个页面page,依次访问每个基金的净值数据页面并爬取数据。核心代码如下:Here Insert Picture Description
代码中的get_data()函数用于净值数据页面解析和数据的转化,get_all_codes()函数用于获取全部开放式基金的基金代码(共6000余个)。虽然程序也使用了async/await的结构,但是对多个基金的净值数据获取都是在callurl_and_getdata()函数中顺序执行的,之所以这样写是因为pyppeteer中的方法都是coroutine对象,必须以这种形式构建程序。

为了排除打开浏览器的耗时干扰,我们仅统计访问页面和数据抓取的用时,其结果为:12.08秒。

3).异步执行

Here we look at the transformation of the program, the performance function are the same, mainly to circulating converted into async operation fundlist of the task object. The core code is as follows: Here Insert Picture Description
time-consuming statistical range still counted after the browser opens when it is run with: 2.18 seconds faster order execution compared to six times. Imagine, if required crawling workload is relatively large, the order of execution required 10 hours, then asynchronous execution may only take less than two hours, the optimization effect can be described as very obvious.

We recommend the python learning sites to see how seniors are learning! From basic python script, reptiles, django, data mining

And other programming techniques, as well as to combat zero-based sorting data items, given to every love learning python small partner! Every day, veteran

Python method to explain the timing of technology, to share some of the learning and the need to pay attention to small details, click on Join us python learner gathering

Published 27 original articles · won praise 22 · views 20000 +

Guess you like

Origin blog.csdn.net/haoxun05/article/details/104383049