Example: Using puppeteer headless to crawl JS web pages

puppeteer

The puppeteer produced by the google chrome team is an automated testing library that relies on nodejs and chromium. Its biggest advantage is that it can handle dynamic content in web pages , such as JavaScript, and can better simulate users.
The anti-crawling method of some websites is to hide part of the content in some javascript/ajax requests, so that the method of directly obtaining the a tag does not work. Some websites even set up hidden element "traps" that are invisible to users, and that script triggers are considered machines. In this case, the advantages of puppeteer are highlighted.
It can achieve the following functions:

  1. Generate screenshots and PDFs of pages.
  2. Crawl the SPA and generate pre-rendered content (aka "SSR").
  3. Automatic form submission, UI testing, keyboard input, etc.
  4. Create an up-to-date automated test environment. Run tests directly in the latest version of Chrome, using the latest JavaScript and browser features.
  5. Capture a timeline that tracks your website to help diagnose performance issues.

Open source address: [ https://github.com/GoogleChrome/puppeteer/ ][1]

Install

npm i puppeteer

Pay attention to install nodejs first, and execute it in the root directory of the nodejs file (the same level as the npm file).
During the installation process, chromium will be downloaded, about 120M.

After two days (about 10 hours) of exploration, I bypassed a lot of asynchronous pits, and the author has a certain grasp of puppeteer and nodejs.
A long picture, grabbing a list of blog articles:
Enter image description

Crawl blog posts

Taking csdn blog as an example, the content of the article needs to be obtained by clicking "read full text", which makes the script that can only read dom invalid.

/**
* load blog.csdn.net article to local files
**/
const puppeteer = require('puppeteer');
//emulate iphone
const userAgent = 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1';
const workPath = './contents';
const fs = require("fs");
if (!fs.existsSync(workPath)) {
        fs.mkdirSync(workPath)
}
//base url
const rootUrl = 'https://blog.csdn.net/';
//max wait milliseconds
const maxWait = 100;
//max loop scroll times
const makLoop = 10;
(async () => {
    let url;
    let countUrl=0;
    const browser = await puppeteer.launch({headless: false});//set headless: true will hide chromium UI
    const page = await browser.newPage();
    await page.setUserAgent(userAgent);
    await page.setViewport({width:414, height:736});
    await page.setRequestInterception(true);
    //filter to block images
    page.on('request', request => {
    if (request.resourceType() === 'image')
      request.abort();
    else
      request.continue();
    });
    await page.goto(rootUrl);
    
    for(let i= 0; i<makLoop;i++){
        try{
            await page.evaluate(()=>window.scrollTo(0, document.body.scrollHeight));
            await page.waitForNavigation({timeout:maxWait,waitUntil: ['networkidle0']});
        }catch(err){
            console.log('scroll to bottom and then wait '+maxWait+'ms.');
        }
    }
    await page.screenshot({path: workPath+'/screenshot.png',fullPage: true, quality :100, type :'jpeg'});
    //#feedlist_id li[data-type="blog"] a
    const sel = '#feedlist_id li[data-type="blog"] h2 a';
    const hrefs = await page.evaluate((sel) => {
        let elements = Array.from(document.querySelectorAll(sel));
        let links = elements.map(element => {
            return element.href
        })
        return links;
    }, sel);
    console.log('total links: '+hrefs.length);
    process();
  async function process(){
    if(countUrl<hrefs.length){
        url = hrefs[countUrl];
        countUrl++;
    }else{
        browser.close();
        return;
    }
    console.log('processing url: '+url);
    try{
        const tab = await browser.newPage();
        await tab.setUserAgent(userAgent);
        await tab.setViewport({width:414, height:736});
        await tab.setRequestInterception(true);
        //filter to block images
        tab.on('request', request => {
        if (request.resourceType() === 'image')
          request.abort();
        else
          request.continue();
        });
        await tab.goto(url);
        //execute tap request
        try{
            await tab.tap('.read_more_btn');
        }catch(err){
            console.log('there\'s none read more button. No need to TAP');
        }
        let title = await tab.evaluate(() => document.querySelector('#article .article_title').innerText);
        let contents = await tab.evaluate(() => document.querySelector('#article .article_content').innerText);
        contents = 'TITLE: '+title+'\nURL: '+url+'\nCONTENTS: \n'+contents;
        const fs = require("fs");
        fs.writeFileSync(workPath+'/'+tab.url().substring(tab.url().lastIndexOf('/'),tab.url().length)+'.txt',contents);
        console.log(title + " has been downloaded to local.");
        await tab.close();
    }catch(err){
        console.log('url: '+tab.url()+' \n'+err.toString());
    }finally{
        process();
    }
    
  }
})();


Implementation process

The screen recording can be viewed on my official account, and the screenshot is below:

Enter image description

Results of the

Article content list:

Enter image description

Article content:

Enter image description

concluding remarks

I thought before that since nodejs uses the JavaScript scripting language, it must be able to process the JavaScript content of web pages, but I have not found a suitable/efficient library. It was not until I found puppeteer that I decided to test the waters.
Having said that, the asynchrony of nodejs is really a headache. I have tossed for 10 hours with these hundreds of lines of code.
You can expand the process()method in the code, use async.eachSeries, the recursive method I use is not the optimal solution.
In fact, it is not efficient to process one by one. Originally I wrote an asynchronous shutdown browser method:

let tryCloseBrowser = setInterval(function(){
        console.log("check if any process running...")
        if(countDown<=0){
          clearInterval(tryCloseBrowser);
          console.log("none process running, close.")
          browser.close();
        }
    },3000);

According to this idea, the initial version of the code is to open multiple tab pages at the same time, which is very efficient, but the fault tolerance rate is very low. You can try to write it yourself.

Off topic

Anyone who has read my articles will know that when I write articles, I emphasize the ways/methods of dealing with problems, and give you some thinking suggestions.
I'm completely new to nodejs and puppeteer (of course, I know what they are suitable for, that's all). If you still remember the concept of on-demand memory mentioned in "10x Speed ​​Programmer" , then you will understand that I deliberately do not systematically learn new technologies.
I say that I contact puppeteer to complete all the logic of thinking I need to function:

  1. Understand the functions/features of puppeteer, and judge whether the requirements are met according to the purpose.
  2. Quickly implement all demos in getStart
  3. Judging the characteristics of puppeteer twice, from the designer's point of view, speculate the architecture of puppeteer.
  4. Validate the schema.
  5. Read through the api for puppeteer details.
  6. Search for puppeteer pre-learning content (and pre-learning content that pre-learning content depends on). Organize the learning content tree and go back to 1.
  7. Design/Analyze/Debug/…

May 9, 2018 02:13

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325961803&siteId=291194637