[Weekly] small projects using puppeteer plug-crawling dynamic website

table of Contents

0. Introduction

This two-day reptiles began to be interested in, it is the beginning of a rate derived from the End of the World God paste, cover tens of thousands of layers, chasing read for a long time. End of the World side of the page, "Look landlord," the need for Members, end of the phone can "look at the landlord," but the experience is not very good, not easy to record, and decided to speak to the landlord to climb down alone, not only can save, can also be retrieved.

Initially the idea is very simple, elemental retrieve each page, the poster with the name of the landlord match, put inside the content copy them.

First found online tool is cheerioplug it in after reading the website, will survive the site content, content selected by the element selector. After using recursion, but also to solve the problem page.

Fact true, a simple few steps, took the landlord to speak preserved, but also let me reptiles have an interest.

problem

cheerioReally easy to use, there is no problem in dealing with simple static pages. But to deal with a certain anti-climb mechanism website can not do anything. For example, cheerioto solve the problem page, relying on the dynamic modification urllink. But some sites, such as my favorite omelette , its web page link is garbled, there is no way to automatically flip. As another example, some real estate web site in the resource list at the time of sale, to the user experience, the use of lazy loading , only the page after rolling in the end part, to trigger the load.

All these actually cheeriofor page action is powerless.

solve

When the Internet to find methods to deal with lazy loading, I found the puppeteerplug. Google browser in Chrome Headless 17 has developed its own characteristics, and with the same time launched a puppeteer, is essentially a free browser interface, a bit like a computer terminal, all operations are operated through code.

In this way, we will be able to search the site prior to the operation of the specified element rolling in the end part, to trigger more information. Or flip when needed, the operation code page button is clicked, then page after page related treatment.

1. Download the lead packet

// 下载
npm i puppeteer

// 引包
const puppeteer = require('puppeteer')

2. Use the step

// 将整个操作放置在一个闭包的异步函数中,以便于进行异步操作
(async () => {

    // 1. 使用puppetee插件启动一个浏览器,并开启一个新页面
    const brower = await puppeteer.launch({
        args: ['--no-sandbox'],
        dumpio: false,
        headless:false, // 默认为true,设为false时,可以显示可视化浏览器界面
    })
    const page = await brower.newPage() // 开启一个新页面

    // 2. 打开指定网页
    await page.goto('http://jandan.net/ooxx', {
        waitUntil: 'networkidle2' // 网络空闲说明已加载完毕
    });

    // 3. 对动态网站进行自动化操作,这一步是其精髓所在
    
    // 由于我们监控的是动态网页,刚打开网页时,所需元素也许还未出现,所以需要进行监听,例如“下一页按钮”
    
    await page.waitForSelector('a.previous-comment-page'); // 括号内是元素选择器
    
    // 当下一页按钮出现时,模拟点击
    await page.click('a.previous-comment-page')
    

    // 4. 这时我们可以执行爬取我们需要的数据了,我们可以去审查页面的dom结果,来循环遍历这些数据。
        // page.evaluate() 为在浏览器中执行函数,相当于在控制台中执行函数,返回一个 Promise
    const result = await page.evaluate(() => {
        // 拿到页面上的jQuery
        var $ = window.$;
        // 在这里进行熟悉的 DOM 操作
        // Do something
    });
    
    // 5. 关闭浏览器,在console里面打印我们需要的数据
    brower.close();

    // 6. 对结果进行处理
    console.log(result);
})();

3. climbed several pit

page.evaluate problem of mass participation

Because open this page is just a puppet, not a real browser page, so the operation on this page and on-page general differences.

The official documentation says this argument is this. In actual use, may pass a string variable, but to little more complicated, such as 'fs', a custom external function, it can not be read.

This is my suggestion in step 6, page after the operation is complete, unified process the results. (Mainly because I do not solve this problem, it is considered counseling bypass gone ......)

Elements of operational problems

puppeteerThe most important elements of function execution and selection are operating with some general browser difference, here are some pit to climb, and now I do not know.

Guess you like

Origin www.cnblogs.com/half-bug/p/12060759.html