python web crawler --Pyppeteer Pyppeteer

Pyppeteer

Basic module using pyppeteer

introduction

Selenium is used when there is a hassle, is related to the configuration environment, a good correlation was installed browsers such as Chrome, Firefox, etc., and then have to go to the official website to download the corresponding drivers, most importantly also need to install the corresponding the Python Selenium library really is not very easy to do if another large-scale deployment, then configure some of the problems of the environment is also a headache. Then this section will introduce another similar alternatives, called Pyppeteer.

Pyppeteer Profile

Note that this section describes the module is called Pyppeteer, not the Puppeteer. Puppeteer is a tool based on Google Node.js development, and with it some of the operations that we can control the Chrome browser via JavaScript, of course, also be used as a web crawler, its API is extremely sound, very powerful. The Pyppeteer what is it? It is actually a Python version of the Puppeteer of realization, but he is not developed by Google, is an unofficial version from the Japanese engineers developed based on some of the features of the Puppeteer.

In Pyppetter in fact behind it also has a similar Chrome browser Chromium browser page rendering in performing some action, first said origin under Chrome and Chromium browsers.

  Chromium 是谷歌为了研发 Chrome 而启动的项目,是完全开源的。二者基于相同的源代码构建,Chrome 所有的新功能都会先在 Chromium 上实现,待验证稳定后才会移植,因此 Chromium 的版本更新频率更高,也会包含很多新的功能,但作为一款独立的浏览器,Chromium 的用户群体要小众得多。两款浏览器“同根同源”,它们有着同样的 Logo,但配色不同,Chrome 由蓝红绿黄四种颜色组成,而 Chromium 由不同深度的蓝色构成。

Pyppeteer is dependent on the Chromium browser to run. With so after Pyppeteer, we can avoid those tedious environment configuration and other issues. If the first time you run, Chromium browser is not installed, the program will help us to automatically install and configure, it eliminates the tedious work environment configuration. In addition Pyppeteer is based on the new features of Python async achieve, so some of its execution also supports asynchronous operation, efficiency relative Selenium is also increased.

Environment Installation

  • Since Pyppeteer using Python's async mechanism, so that the operational requirements of Python version 3.5 and above
  • pip install pyppeteer

Quick Start

 

- crawling http://quotes.toscrape.com/js/  entire page data

import asyncio
from pyppeteer import launch
from lxml import etree
 
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('http://quotes.toscrape.com/js/')
page_text = await page.content()
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@class="quote"]')
print(len(div_list))
await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())
from pyppeteer Import Launch
 Import ASYNCIO
 from lxml Import etree 

# instantiate an object browser (Google beta) 
the async DEF main (): 
    Bro = the await Launch ()
     # Create a new blank page 
    Page = bro.newPage () 
    page.goto ( ' http://quotes.toscrape.com/js/ ' ) 

    # acquires page data currently displayed page source 
    page_text = the await Page.Content () 

    return page_text 

DEF the parse (Task): 
    page_text = task.result () 
    Tree= etree.HTML(page_text)
    div_list = tree.xpath('//div[@class="quote"]')
    for div in div_list:
        content = div.xpath('./span[1]/text()')
        print(content)

c = main()
task = asyncio.ensure_future(c)
task.add_done_callback(parse)
loop = asyncio.get_event_loop()
loop.run_until_complete(c)
Binding coroutine

 

 Explanation:

  Browser launch method creates a new object, and then assigned to the browser, and then call newPage method is equivalent to a new browser tab, at the same time created a new Page object. Then Page object calls a goto method is equivalent to input this URL, the browser to jump to the corresponding page loaded in the browser, and then call the method after the content has finished loading, return to the source code of the current browser page. Then further, we performed the same resolve with pyquery, we can get a result JavaScript rendering. In this process, we did not configure Chrome browser, the browser is not configured drive, eliminating the cumbersome steps to achieve the same effect Selenium, but also to achieve the asynchronous grab.

 

Detailed usage

  • Open your browser
  • You can call the launch method, the relevant parameters introduced:
    • ignoreHTTPSErrors (bool): Do you want to ignore HTTPS error, the default is False.
    • headless (bool): Headless mode is enabled, that is no interface mode, if devtools this parameter is True, then the parameters will be set to False, otherwise True, that is enabled by default no interface mode.
    • executablePath (str): path to the executable file, and then if you do not need to specify the default Chromium, you can specify an existing Chrome or Chromium.
    • args (List [str]): In the process of implementation can pass additional parameters.
    • devtools (bool): whether to automatically turn on debugging tools for each page, the default is False. If this parameter is set to True, the headless parameters will be invalid and will be forced to False.
  • Turn off the prompt article: "Chrome is being controlled automatic test software", the article prompted irritated, that ye shut it? This time we need to use args parameter, and disable the operation is as follows:
    • browser = await launch(headless=False, args=['--disable-infobars']) 

 

Basic module using pyppeteer

introduction

Selenium is used when there is a hassle, is related to the configuration environment, a good correlation was installed browsers such as Chrome, Firefox, etc., and then have to go to the official website to download the corresponding drivers, most importantly also need to install the corresponding the Python Selenium library really is not very easy to do if another large-scale deployment, then configure some of the problems of the environment is also a headache. Then this section will introduce another similar alternatives, called Pyppeteer.

Pyppeteer Profile

Note that this section describes the module is called Pyppeteer, not the Puppeteer. Puppeteer is a tool based on Google Node.js development, and with it some of the operations that we can control the Chrome browser via JavaScript, of course, also be used as a web crawler, its API is extremely sound, very powerful. The Pyppeteer what is it? It is actually a Python version of the Puppeteer of realization, but he is not developed by Google, is an unofficial version from the Japanese engineers developed based on some of the features of the Puppeteer.

In Pyppetter in fact behind it also has a similar Chrome browser Chromium browser page rendering in performing some action, first said origin under Chrome and Chromium browsers.

  Chromium 是谷歌为了研发 Chrome 而启动的项目,是完全开源的。二者基于相同的源代码构建,Chrome 所有的新功能都会先在 Chromium 上实现,待验证稳定后才会移植,因此 Chromium 的版本更新频率更高,也会包含很多新的功能,但作为一款独立的浏览器,Chromium 的用户群体要小众得多。两款浏览器“同根同源”,它们有着同样的 Logo,但配色不同,Chrome 由蓝红绿黄四种颜色组成,而 Chromium 由不同深度的蓝色构成。

Pyppeteer is dependent on the Chromium browser to run. With so after Pyppeteer, we can avoid those tedious environment configuration and other issues. If the first time you run, Chromium browser is not installed, the program will help us to automatically install and configure, it eliminates the tedious work environment configuration. In addition Pyppeteer is based on the new features of Python async achieve, so some of its execution also supports asynchronous operation, efficiency relative Selenium is also increased.

Environment Installation

  • Since Pyppeteer using Python's async mechanism, so that the operational requirements of Python version 3.5 and above
  • pip install pyppeteer

Quick Start

 

- crawling http://quotes.toscrape.com/js/  entire page data

import asyncio
from pyppeteer import launch
from lxml import etree
 
async def main():
browser = await launch()
page = await browser.newPage()
await page.goto('http://quotes.toscrape.com/js/')
page_text = await page.content()
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@class="quote"]')
print(len(div_list))
await browser.close()
 
asyncio.get_event_loop().run_until_complete(main())

Guess you like

Origin www.cnblogs.com/bilx/p/11572838.html