Lecture 19: Pyppeteer crawling in action

In the last lesson, we learned about the basic usage of Pyppeteer, and indeed we can find that it has a lot of convenience compared to Selenium.

In this lesson, we will use Pyppeteer to rewrite the previous Selenium case to understand the difference between the two, and at the same time strengthen the understanding and mastery of Pyppeteer.

1. Crawl the target

In this lesson, the goal we want to crawl is the same as before, or the case of Selenium, the address is: https://dynamic2.scrape.cuiqingcai.com/ , as shown in the figure below.
Insert picture description here
The URL of each detail page of this website has encrypted parameters, and the Ajax interface also has encrypted parameters and timeliness. The specific introduction can be seen in the Selenium class.

2. Objectives of this section

The crawl target is the same as in that section:

Traverse the list page of each page, and then get the URL of the detail page of each movie.
Crawl the details page of each movie, and then extract its name, rating, category, cover, introduction and other information.
The crawled data is saved as a JSON file.

The requirements are the same as before, but our implementation here is all done with Pyppeteer.

3. Preparation

Before starting this lesson, we need to make the following preparations:

Install the Python (minimum Python 3.6) version and run the Python program successfully.
Install Pyppeteer and run the example successfully.

Other browsers and driver configurations are not needed, which is more convenient than Selenium.

The page analysis is not introduced here, but the structure of the list page + details page. For details, please refer to the content of the Selenium lesson.

4. Crawl the list page

First, let's do some preparatory work, define some basic configurations, including log definitions, variables, etc., and introduce some necessary packages. The code is as follows:

import logging
logging.basicConfig(level=logging.INFO,
                   format='%(asctime)s - %(levelname)s: %(message)s')
INDEX_URL = 'https://dynamic2.scrape.cuiqingcai.com/page/{page}'
TIMEOUT = 10
TOTAL_PAGE = 10
WINDOW_WIDTH, WINDOW_HEIGHT = 1366, 768
HEADLESS = False

Most of the configuration here is the same as before, but here we additionally define the width and height information of the window, which is defined as 1366 x 768. You can also specify the width and height information that suits your screen at will. In addition, a variable HEADLESS is defined here to specify whether to enable Pyppeteer's headless mode. If it is False, a Chromium browser window will pop up when Pyppeteer is started.

Then we define a method to initialize Pyppeteer, including starting Pyppeteer, creating a new page tab, setting the window size and other operations. The code implementation is as follows:

from pyppeteer import launch
browser, tab = None, None
async def init():
   global browser, tab
   browser = await launch(headless=HEADLESS,
                          args=['--disable-infobars',
                                f'--window-size={WINDOW_WIDTH},{WINDOW_HEIGHT}'])
   tab = await browser.newPage()
   await tab.setViewport({
    
    'width': WINDOW_WIDTH, 'height': WINDOW_HEIGHT})

Here we first declare a browser object, which represents the browser object used by Pyppeteer, and tab represents the newly created page tab. Here, two items are set as global variables to facilitate other method calls.

In addition, an init method is defined, the launch method of Pyppeteer is called, the headless is passed in as HEADLESS, and it is set to non-headless mode, and the hidden prompt bar is specified through args and the width and height of the window are set.

Next we define a general crawling method as before, the code is as follows:

from pyppeteer.errors import TimeoutError
async def scrape_page(url, selector):
   logging.info('scraping %s', url)
   try:
       await tab.goto(url)
       await tab.waitForSelector(selector, options={
    
    
           'timeout': TIMEOUT * 1000
       })
   except TimeoutError:
       logging.error('error occurred while scraping %s', url, exc_info=True)

Here we define a scrape_page method, which receives two parameters, one is url, which represents the link to be crawled, and can be called by the goto method; the other is selector, which is the CSS selector corresponding to the node to be rendered. Here we use the waitForSelector method and pass in the selector, and specify the maximum waiting time through options.

In this case, the page will first visit this URL at runtime, and then wait for a node that matches the selector to load, and wait for up to 10 seconds. If it is loaded within 10 seconds, then continue to execute, otherwise an exception will be thrown and caught TimeoutError and output error log.

Next, we will implement the method of crawling the list page. The code is implemented as follows:

async def scrape_index(page):
   url = INDEX_URL.format(page=page)
   await scrape_page(url, '.item .name')

Here we define the scrape_index method to crawl the page. It accepts a parameter page, which represents the page number to be crawled. Here we first construct the URL of the list page through INDEX_URL, and then call the scrape_page method to pass in the url and the selection to wait for loading Device.

The selector here is .item .name, which is the name of each movie in the list page. If this is loaded, it means that the page is loaded successfully, as shown in the figure.
Insert picture description here
Okay, then we can define another method to parse the list page and extract the URL of the detail page of each movie, which is defined as follows:

async def parse_index():
   return await tab.querySelectorAllEval('.item .name', 'nodes => nodes.map(node => node.href)')

Here we call the querySelectorAllEval method, which receives two parameters. The first parameter is selector, which represents the CSS selector corresponding to the node to be selected; the second parameter is pageFunction, which represents the JavaScript method to be executed, which needs to be passed here. The input is a JavaScript string. The function of the whole method is to select the nodes corresponding to the selector, and then extract the corresponding results from these nodes through the logic defined by pageFunction and return them.

So here the first parameter selector is passed in the node corresponding to the movie name, which is actually the hyperlink a node. Since there are multiple extraction results, the pageFunction input parameter corresponding to JavaScript here is nodes, and the output result is to call the map method to get each node, and then call the href attribute of the node. In this way, the returned result is a list composed of the URLs of the details pages of all movies on the current list page.

OK, let's take a look at the serial call, the code is implemented as follows:

import asyncio
async def main():
   await init()
   try:
       for page in range(1, TOTAL_PAGE + 1):
           await scrape_index(page)
           detail_urls = await parse_index()
           logging.info('detail_urls %s', detail_urls)
   finally:
       await browser.close()
if __name__ == '__main__':
   asyncio.get_event_loop().run_until_complete(main())

Here we define a mian method, and call several methods previously defined in series. First call the init method, then loop through the page numbers, call the scrape_index method to crawl each page of the list page, and then call the parse_index method to extract each URL of the detail page from the list page, and then output the result.

The results are as follows:

2020-04-08 13:54:28,879 - INFO: scraping https://dynamic2.scrape.cuiqingcai.com/page/1
2020-04-08 13:54:31,411 - INFO: detail_urls ['https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIx', ...,
'https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWI5', 'https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIxMA==']
2020-04-08 13:54:31,411 - INFO: scraping https://dynamic2.scrape.cuiqingcai.com/page/2

Due to the large content, part of the content is omitted here.

Here you can see that each return result will be a list of all the details page URLs extracted from the current list page. We can use these URLs to crawl in the next step.

5. Crawl the details page

After getting the URL of the detail page, the next step is to crawl each detail page and extract the information. First, we define a method for crawling the detail page. The code is as follows:

async def scrape_detail(url):
   await scrape_page(url, 'h2')

The code is very simple, that is, the scrape_page method is directly called, and then the selector of the node to be loaded is passed in. Here we directly use h2, which corresponds to the movie name of the detail page, as shown in the figure.
Insert picture description here
If it runs smoothly, then Pyppeteer has successfully loaded the details page, and the next step is to extract the information inside.

Next, we define a method to extract detailed information, the code is as follows:

async def parse_detail():
   url = tab.url
   name = await tab.querySelectorEval('h2', 'node => node.innerText')
   categories = await tab.querySelectorAllEval('.categories button span', 'nodes => nodes.map(node => node.innerText)')
   cover = await tab.querySelectorEval('.cover', 'node => node.src')
   score = await tab.querySelectorEval('.score', 'node => node.innerText')
   drama = await tab.querySelectorEval('.drama p', 'node => node.innerText')
   return {
    
    
       'url': url,
       'name': name,
       'categories': categories,
       'cover': cover,
       'score': score,
       'drama': drama
   }

Here we define a parse_detail method to extract the URL, name, category, cover, score, introduction, etc. The extraction method is as follows:

URL: directly call the url property of the tab object to get the URL of the current page.
Name: Since the name has only one node, here we call the querySelectorEval method to extract, instead of querySelectorAllEval, the first parameter is passed in h2, the node corresponding to the name is extracted, and then the second parameter is passed in the extracted pageFunction, which is called The innerText property of node extracts the text value, which is the movie name.
Category: There are multiple categories, so here we call the querySelectorAllEval method to extract. The corresponding CSS selector is .categories button span, and multiple category nodes can be selected. Next, as before extracting the details page URL, pageFunction uses the nodes parameter, and then calls the map method to extract the innerText of the node to get all the category results.
Cover: Similarly, you can use the CSS selector .cover to directly obtain the node corresponding to the cover, but since the URL of the cover corresponds to the src attribute, the src attribute is extracted here.
Score: The CSS selector corresponding to the score is .score. For a similar principle, just extract the innerText of the node.
Introduction: You can also use the CSS selector .drama p to directly obtain the node corresponding to the introduction, and then call the innerText property to extract the text.

Finally, we summarize the extraction results into a dictionary and then return.

Next, in the main method, we add calls to the scrape_detail and parse_detail methods, and the main method is rewritten as follows:

async def main():
   await init()
   try:
       for page in range(1, TOTAL_PAGE + 1):
           await scrape_index(page)
           detail_urls = await parse_index()
           for detail_url in detail_urls:
               await scrape_detail(detail_url)
               detail_data = await parse_detail()
               logging.info('data %s', detail_data)
   finally:
       await browser.close()

Review the running results, the running results are as follows:

2020-04-08 14:12:39,564 - INFO: scraping https://dynamic2.scrape.cuiqingcai.com/page/1
2020-04-08 14:12:42,935 - INFO: scraping https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIx
2020-04-08 14:12:45,781 - INFO: data {
    
    'url': 'https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIx', 'name': '霸王别姬 - Farewell My Concubine', 'categories': ['剧情', '爱情'], 'cover': 'https://p0.meituan.net/movie/ce4da3e03e655b5b88ed31b5cd7896cf62472.jpg@464w_644h_1e_1c', 'score': '9.5', 'drama': '影片借一出《霸王别姬》的京戏，牵扯出三个人之间一段随时代风云变幻的爱恨情仇。段小楼（张丰毅 饰）与程蝶衣（张国荣 饰）是一对打小一起长大的师兄弟，两人一个演生，一个饰旦，一向配合天衣无缝，尤其一出《霸王别姬》，更是誉满京城，为此，两人约定合演一辈子《霸王别姬》。但两人对戏剧与人生关系的理解有本质不同，段小楼深知戏非人生，程蝶衣则是人戏不分。段小楼在认为该成家立业之时迎娶了名妓菊仙（巩俐 饰），致使程蝶衣认定菊仙是可耻的第三者，使段小楼做了叛徒，自此，三人围绕一出《霸王别姬》生出的爱恨情仇战开始随着时代风云的变迁不断升级，终酿成悲剧。'}
2020-04-08 14:12:45,782 - INFO: scraping https://dynamic2.scrape.cuiqingcai.com/detail/ZWYzNCN0ZXVxMGJ0dWEjKC01N3cxcTVvNS0takA5OHh5Z2ltbHlmeHMqLSFpLTAtbWIy

It can be seen here that the list page is first crawled, and then the detail page is extracted, and then the detail page is extracted, and then the movie information we want is extracted, and then the next detail page is crawled.

In this way, all the details pages will be crawled down by us.

6. Data storage

Finally, we add a data storage method as before. For convenience, save it as a JSON text file here. The implementation is as follows:

import json
from os import makedirs
from os.path import exists
RESULTS_DIR = 'results'
exists(RESULTS_DIR) or makedirs(RESULTS_DIR)
async def save_data(data):
   name = data.get('name')
   data_path = f'{RESULTS_DIR}/{name}.json'
   json.dump(data, open(data_path, 'w', encoding='utf-8'), ensure_ascii=False, indent=2)

The principle here is exactly the same as before, but because we are using Pyppeteer here, which is an asynchronous call, we need to add async before the save_data method.

Finally, add a call to save_data to see the running effect completely.

7. Troubleshooting

During operation, due to the implementation of Pyppeteer itself, the following error may appear in the console after running continuously for 20 seconds:

pyppeteer.errors.NetworkError: Protocol Error (Runtime.evaluate): Session closed. Most likely the page has been closed.

The reason is that Pyppeteer uses Websocket internally. If the Websocket client does not receive a pong response 20 seconds after sending a ping signal, the connection will be terminated.

For the solution and detailed description of the problem, please see https://github.com/miyakogi/pyppeteer/issues/178 . At this time, we can modify the Pyppeteer source code to solve this problem. For the corresponding code modification, see: https://github. com/miyakogi/pyppeteer/pull/160/files , that is, add ping_interval=None and ping_timeout=None to the connect method.

In addition, you can also rewrite the implementation of Connection. The solution can also be found at https://github.com/miyakogi/pyppeteer/pull/160, such as the definition of patch_pyppeteer.

8. Headless mode

Finally, if the code can run stably, we can change it to headless mode and change HEADLESS to True, so that the browser window will not pop up when running.

9. Summary

In this lesson, we explained the process of Pyppeteer crawling a complete website through examples, so as to have a further grasp of the use of Pyppeteer.
Code for this section: https://github.com/Python3WebSpider/ScrapeDynamic2 .