Puppeteer tutorial - basics (super detailed, super comprehensive)

Table of contents

foreword

1. Introduction to Puppeteer

2. The relationship between Browser, browserContent and Page in Puppeteer

 3. Puppeteer APIS

3.1、 Browser

3.2 Page

3.2.1 page.on(request/response)

3.2.2 page.$()

3.2.3 page.$$()

3.2.4 page.$eval()

 3.2.5 page.addScriptTag()

 3.2.6 page.addStyleTag(options)

3.2.7 page.click

3.2.8 page.cookies()

3.2.9 page.evaluate(pageFunction[, ...args])

3.2.10 page.exposeFunction(name, puppeteerFunction)

3.2.11 page.focus()、page.hover(selector) 、page.mouse()

3.2.12 page.pdf()

3.2.13 page.setCookie(...cookies)

3.2.14 page.setRequestInterception(value[Boolean])

3.2.15 Focus on the request: Request, Response

3.2.16 page.type(selector, text[, options])

3.2.17 page.waitFor(selectorOrFunctionOrTimeout[, options[, ...args]])

3.2.18 page.waitForRequest

Four. Summary


foreword

        Puppeteer is currently a relatively friendly Node library that implements crawlers, automated testing, page capture, etc., but there are few related blogs on the Internet, and none of the articles can clearly explain the relevant content, API, and examples of puppeteer, so I took the opportunity to write An article, sharing what I know, is wrong.

Puppeteer Chinese official website

Puppeteer English official website

Gitee address

1. Introduction to Puppeteer

        Puppeteer is a Node library that provides a high-level API to control Chromium or Chrome via the DevTools protocol. Puppeteer runs in headless mode by default, but it can be run in "headed" mode by modifying the configuration file. When creating a browser, you can control the headless mode by passing in configuration items, as follows:

const puppeteer = require('puppeteer');

const browser = puppeteer.launch({ headless: false })
// 为false表示不开启无头模式,则运行程序时,会有puppeteer的内核浏览器开启运行,模拟页面操作
// 当关闭无头模式后,可能会导致电脑闪屏,也是偶发的,反正我的电脑是会这样

2. The relationship between Browser, browserContent and Page in Puppeteer

 The picture above shows the browser structure of Puppeteer displayed on the official website. The browser created by Puppeteer can be controlled through the DevTools protocol.

// 创建 browser
const browser =puppeteer.launch()

// 创建 browserContent
browser.newPage()

// 创建 Page
page.goto("URL")
  • Puppeteer communicates with the browser using the Devtools protocol;
  • A Browser instance can own a browser context;
  • A BrowserContent instance defines a browser session and can have multiple pages;
  • Page is what we understand as a tab page.

 3. Puppeteer APIS

        The official website of some other things is very clear, let's start learning the API provided by Puppeteer directly.

3.1、 Browser

        When Puppeteer is connected to a Chromium instance, a Browser object will be created through  puppeteer.launch  or  puppeteer.connect  , and the created browser can be controlled by passing in options, such as  headless , defaultViewport , timeout and so on.

        Browser disconnection and reconnection:

const puppeteer = require('puppeteer');

puppeteer.launch().then(async browser => {
  // 存储节点以便能重新连接到 Chromium
  const browserWSEndpoint = browser.wsEndpoint();
  // 从 Chromium 断开和 puppeteer 的连接
  browser.disconnect();

  // 使用节点来重新建立连接
  const browser2 = await puppeteer.connect({browserWSEndpoint});
  // 关闭 Chromium
  await browser2.close();
});

3.2 Page

        Page provides   methods for manipulating a tab page or extension background page . Browser  instance can have multiple  Page  instances. It can be created by calling newPage() through the created Browser, as follows:

const browser = await puppeteer.launch();
const page = await browser.newPage()

        page is an important role in puppeteer, which provides us with operable objects, including obtaining dom content, page screenshots, saving PDF and other operations. Page can trigger Node native events, and remove the event monitoring through on once removeListener. The list of listenable events is as follows:

         We can monitor page behavior through page events, including request and response to monitor page requests, and implement data interception, etc., which is quite useful.

3.2.1 page.on(request/response)

 var xhr = new XMLHttpRequest();

 xhr.open("get", "https://dog.ceo/api/breeds/image/random");
 
 xhr.send();

 xhr.onload = () => {
     console.log(xhr.response);
 }

The appeal code simulates a request. How can we monitor this request in Puppeteer and get the response data?

 const browser = await puppeteer.launch({ headless: false });

  const page = await browser.newPage();

  //   说明,以下代码我是开启了项目 demo.html
  await page.goto("http://127.0.0.1:5500/demo.html");

  page.on("request", (req) => {
    console.log("req.headers()", req.headers());
    console.log("req.method()", req.method());
    console.log("postData()", req.postData());
  });

  page.on("response", async (res) => {
    console.log(await res.text());
  });

The results obtained are as follows:

  The parameters and responses of our page request:

        We can also modify the parameters of the request, intercept the response, etc. before the request is sent. For more details, you can see the sample description on the Puppeteer official website: Puppeteer API | Puppeteer Chinese Documentation | Puppeteer Chinese Website

3.2.2 page.$()

        This method is executed within the page  document.querySelector. If no element matches the specified selector, the return value is  null,方法返回的是the DOM element in the page, and the element can be obtained for page operation.

  const btn = await page.$("button");
  
  await btn.click();

3.2.3 page.$$()

        The method executes document.querySelectorAll, the usage is the same as above.

3.2.4 page.$eval()

        page.$eval(selector, pageFunction[, ...args]), the method will pass the matched element as the first parameter to  pageFunction。第一个参数是选择器,第二个参数是回调函数,page.$('selector', callback). It can also be realized through the third parameter Context parameters are passed to puppeteer to implement dynamic parameters. This is also our commonly used method. Let me focus on:

  // page $eval 方法 page.$eval('selector',callback,[...args])
  const data = await page.$eval("h1", (h1) => {
    // 这里的环境是在 puppeteer 浏览器中,不是在 外部环境,因此 log 在打开的浏览器中看
    // 返回的结果作为方法的结果
    console.log("puppeteer 内核浏览器", h1);
    return "主动返回的结果作为 方法返回值";
  });

  console.log("page.$eval 结果:", data);

         In the page.$eval() method, it is exactly the same as you operate the page in the actual browser console, so it is more in line with our actual operating habits. I prefer this method to get the operation page and data.

        Through this method, the internal and external environment of the parameter can be passed. For example, a variable is generated in your code, which needs to be used when executing the method on the page. You can pass it in through this, as follows  args :

 const params = {
    p1: "p1",
    p2: 30,
    p3: {
      data: [1, 2, 3, 4],
    },
  };
// 一定注意, ...args 需要传,pageFunction 形参中也要写!!!
  const data = await page.$eval(
    "h1",
    (h1, params) => {
      // 这里的环境是在 puppeteer 浏览器中,不是在 外部环境,因此 log 在打开的浏览器中看
      // 返回的结果作为方法的结果
      console.log("puppeteer 内核浏览器", h1);
      console.log("外部参数", params);
      return "主动返回的结果作为 方法返回值";
    },
    params
  );

 The method function cannot be directly used by the internal browser in the form of parameter passing , and we will introduce another form of implementation later.

const saveImg = (path) => {
    console.log("有人调用 saveImg 方法了", path);
  };

  const data = await page.$eval(
    "h1",
    (h1, saveImg) => {
      // 这里的环境是在 puppeteer 浏览器中,不是在 外部环境,因此 log 在打开的浏览器中看
      // 返回的结果作为方法的结果
      console.log("puppeteer 内核浏览器", h1);
      console.log("外部参数", saveImg);
    },
    saveImg
  );

 

 However, the external function call can be implemented through the external node environment by returning the data. However, after introducing another API, it is recommended to use the API for implementation. This form of data return is still inconsistent with our programming habits.

 3.2.5 page.addScriptTag()

        puppeteer allows us to insert script tags, or script code snippets, into the page

 //   注入 script
  await page.addScriptTag({
    content: "const a='aaaaa'",
  });

 Insert the js file, then realize it through url:

 But the path is relative to the page path and may not actually be available.

 Therefore, it is recommended to insert the code directly into the code block.

  await page.addScriptTag({
    content:
      "const saveImg = (path) => console.log('调用了 saveImg方法,path=', path); ",
  });

However, this will cause confusion in the script of the original page. If you only want to call the method internally, it is recommended to use the method described later. 

 3.2.6 page.addStyleTag(options)

        The method is similar to above.

3.2.7 page.click

        Simulates an element click. In the above example, we used page.$(), and then called click(), we can directly use page.clcik() to complete.

3.2.8 page.cookies()

        Back page cookies:

 await page.goto("https://www.baidu.com");
  const cookies = await page.cookies("https://www.baidu.com");
  console.log(cookies);

        The following also describes how to set cookies. In the pages that need to log in, you need to set cookies in advance before you can send requests and obtain data . Therefore, the content of cookies is very important. 

3.2.9 page.evaluate(pageFunction[, ...args])

        The method to be executed in the context of the page instance, the usage is similar to page.$eval(), except that there is no selector, and the parameter passing is the same. This method allows us to manipulate data in context more freely.

3.2.10 page.exposeFunction(name, puppeteerFunction)

        Up to now, we can finally call the method in node in the context. name is the method name hung on the window, and puppeteerFunction is the function actually executed by calling the method name. We still use this method to realize the call of the saveImg method.

const saveImg = (path) => {
    console.log("有人调用 saveImg 方法了", path);
  };
  //   添加方法
  await page.exposeFunction("saveImg", (path) => saveImg(path));

// 直接在上下文执行,不需要通过选择器,简单理解,上下文就是 puppeteer 内核浏览器的控制台
  page.evaluate(() => {
    console.log("puppeteer 内核浏览器");
    saveImg("/img/test");
  });

         This is the most convenient way. Imagine that the image and video resources we crawled may need to be stored and processed through other module methods. The method should be called directly in the kernel browser, which is more in line with our programming. train of thought.

3.2.11 page.focus()、page.hover(selector) 、page.mouse()

        These events are relatively simple.

3.2.12 page.pdf()

        Directly to the official website PDF

3.2.13 page.setCookie(...cookies)

        This method is very important! ! ! It is often used to crawl URLs with cookies. It is necessary to obtain and set cookies on the page in advance. The method is as follows:

  await page.setCookie({
    name: "BD_UPN",
    value: "12314753",
  });

How to set the cookies of the actual page to the page?

 This is the real cookie of the page, and the string type is obtained through document.cookies:

 Split the array through split(';'), and each item is in the form of name=value, split('=') for each item, and implement processing through await page.setCookie(cookieObject1, cookieObject2);

const cookies ="BIDUPSID=;BDSFRCVID=Os-....省略....B64_BOT=1;channel=bai47c2-a2fb-4aae994ac343";
 
 const arr = cookies.split(";");
  arr.forEach(async (i) => {
    const [name, value] = i.split("=");
    let obj = { name, value };
    await page.setCookie(obj);
  });

Note: There must be no spaces in the name and value, otherwise Invalid cookie fields will be reported . You can first execute replaceAll(' ','') to clear all spaces, and then perform the split operation.

 In this way, Baidu's cookies are placed in our website. Some webpage requests need to carry cookies, which can be achieved in this way.

3.2.14 page.setRequestInterception(value[Boolean])

        Enables request interceptors, activates  request.abortrequest.continue and  request.respond methods. This provides the ability to modify network requests made by the page. Once request interception is enabled, every request will be stopped unless it continues, responds or aborts.

const puppeteer = require('puppeteer');

puppeteer.launch().then(async browser => {
  const page = await browser.newPage();
  await page.setRequestInterception(true);
  page.on('request', interceptedRequest => {
    if (interceptedRequest.url().endsWith('.png') || interceptedRequest.url().endsWith('.jpg'))
      interceptedRequest.abort();
    else
      interceptedRequest.continue();
  });
  await page.goto('https://example.com');
  await browser.close();
});

If you want to handle the request interceptor, you must first be familiar with the response parameters of the request.request.abort 会取消请求,request.continue则继续请求,request.respond则是结束请求并返回响应状态码:

request.respond({
    status: 404,
    contentType: 'text/plain',
    body: 'Not Found!',
  });

Use req.postData() for request parameters, req.url() for request url, [this method can get params parameters], and req.method() for request method

3.2.15 Focus on the request: Request, Response

        When we get crawled web page data, we get it not only directly from the page, but also from the interface. Therefore, it is also very important to master the attributes of page interception requests, request objects, and response objects.

Request:

request.abort([errorCode]):

        If you want to interrupt the request, you should use  page.setRequestInterception to enable request interception. If request interception is not enabled, an exception will be thrown immediately. (node:17688) UnhandledPromiseRejectionWarning: Error: Request Interception is not enabled!

After opening,

 We can also specify a certain request to interrupt, otherwise the home page will not be visible.

request.continue:

        To continue the request with the optional request override option,  page.setRequestInterception request interception should be enabled with , and an exception will be thrown immediately if request interception is not enabled.

 We can rewrite the request's URL, Headers, request data, etc.

request.headers(): Get request headers

request.method(): Get the request method

request.postData(): Get the request data (the url parameter of the get request is directly intercepted in url(), and postData specifically refers to the post parameter)

request.url(): URL of the request

Request:

response.buffer(): Promise which resolves to a buffer with response body, resolves the buffer in the response.

response.headers(): response headers

response.json(): convert the response body to json format, if the response body cannot be  JSON.parse parsed, this method will throw an error.

response.remoteAddress(): Get the ip port of the remote service

response.text(): Promise which resolves to a text representation of response body, directly textualize the response body.

As shown above, the first request is to request the page, and the second request is to request the interface.

3.2.16 page.type(selector, text[, options])

        It is an important method of page input operation to realize character input operation.

page.type('#mytextarea', 'Hello'); // 立即输入
page.type('#mytextarea', 'World', {delay: 100}); // 输入变慢,像一个用户

3.2.17 page.waitForRequest

        As mentioned above, many of our data can be obtained directly from the interface request, so we must wait for a certain request, so page.waitForRequest(urlOrPredicate[, options]) is for this purpose.

        page.waitForResponse(urlOrPredicate[, options]) can also wait for the response of a certain request.

Four. Summary

        The above is just a brief example of some commonly used APIs, and there are more worker and frame apis. If you need it, you can publish another article to describe it. For page crawling data, requests, responses, and DOM (selectors) are the most commonly used. Automated testing, type character input, and page.$() selection elements are also the most commonly used. Basically, if you master the appeal API, there is no problem with crawling data. I will also publish an example of crawling data to describe the API used in the application.

Guess you like

Origin blog.csdn.net/weixin_47746452/article/details/131751015