How to Implement Dynamic Web Scraping on a Node JS Server Using Puppeteer

Yiniu cloud agent

Introduction

Dynamic web crawling refers to obtaining dynamically generated data on web pages by simulating browser behavior, such as content rendered by JavaScript and data requested by Ajax. The difficulty of dynamic web crawling lies in how to deal with asynchronous events on web pages, such as clicking, scrolling, waiting, etc. Puppeteer is a Node JS-based library that provides a high-level API that can control Chrome or Chromium browsers for dynamic web crawling. This article will introduce how to use Puppeteer to implement dynamic web page crawling on a Node JS server, and give a simple case.

overview

The core function of Puppeteer is to provide a Browser class, which can start a Chrome or Chromium browser instance and return a Browser object. The Browser object can create multiple Page objects, and each Page object corresponds to a browser tab, which can be used to load and operate web pages. The Page object provides a series of methods that can simulate various user behaviors, such as input, click, scroll, screenshot, PDF, etc. The Page object can also listen to events on the web page, such as requests, responses, errors, loads, and more. Through these methods and events, crawling of dynamic web pages can be realized.

text

To use Puppeteer for dynamic web scraping, you first need to install the Puppeteer library. It can be installed via npm or yarn:

// 使用npm安装
npm i puppeteer

// 使用yarn安装
yarn add puppeteer

Once installed, you can import the Puppeteer library into your Node JS code and use it to launch your browser and create pages:

// 引入puppeteer库
const puppeteer = require('puppeteer');

// 启动浏览器并创建页面
(async () => {
    
    
  // 启动浏览器,可以传入一些选项,如无头模式、代理等
  const browser = await puppeteer.launch({
    
    
    headless: false, // 是否无头模式,默认为true
    args: ['--proxy-server=http://username:password@domain:port'] // 设置代理服务器,使用亿牛云爬虫代理的域名、端口、用户名、密码
  });

  // 创建页面
  const page = await browser.newPage();
})();

Once the page is created, you can use the methods of the page object to load and manipulate the web page. For example, you can use the page.goto(url) method to visit a URL and wait for the page to load:

// 访问一个网址,并等待网络空闲(即没有超过500ms的请求)
await page.goto('https://www.example.com', {
    
    waitUntil: 'networkidle0'});

Then, you can use the page.evaluate(pageFunction, ...args) method to execute some JavaScript code in the browser and return the result. For example, you can get the text content of an element on a web page:

// 获取网页上的h1元素的文本内容
const h1Text = await page.evaluate(() => {
    
    
  return document.querySelector('h1').textContent;
});

In addition to the evaluate method, the page object also provides some other methods to obtain and manipulate elements on the web page, such as page. (selector), page. (selector), page.( se l ec t or ) , p a g e . $(selector), page. click(selector), page. type(selector, text), etc. For example, you can simulate a user entering keywords in the search box and clicking the search button:

// 在搜索框中输入关键词
await page.type('#search-input', 'puppeteer');

// 点击搜索按钮
await page.click('#search-button');

Sometimes, we need to wait for some asynchronous events to occur before proceeding to the next step, such as waiting for an element to appear, waiting for a request to complete, and so on. At this time, we can use the page.waitFor(selectorOrFunctionOrTimeout, options, ...args) method to set the waiting conditions. For example, you can wait for a list of search results to appear before fetching its contents:

// 等待搜索结果的列表出现
await page.waitFor('#search-results');

// 获取搜索结果的列表的文本内容
const resultsText = await page.evaluate(() => {
    
    
  return document.querySelector('#search-results').textContent;
});

Finally, when we have finished crawling the webpage, we can use the page.screenshot(options) or page.pdf(options) method to save a screenshot or PDF file of the webpage. For example, you can save a webpage as a picture in png format:

// 将网页保存为png格式的图片
await page.screenshot({
    
    path: 'example.png'});

When we no longer need the browser and the page, we can use the browser.close() method to close the browser:

// 关闭浏览器
await browser.close();

the case

A simple case is given below, using Puppeteer to implement dynamic web page crawling on the Node JS server. The goal of this case is to visit the Baidu homepage, enter the keyword "puppeteer", click the search button, wait for the search results to appear, and save the title and URL of the first link in the search results to a file.

// 引入puppeteer库和fs库(用于文件操作)
const puppeteer = require('puppeteer');
const fs = require('fs');

// 定义一个异步函数,用于执行动态网页抓取
(async () => {
    
    
  // 启动浏览器,设置代理服务器为亿牛云爬虫代理的域名、端口、用户名、密码
  const browser = await puppeteer.launch({
    
    
    args: ['--proxy-server=http://16YUN:[email protected]:3100']
  });

  // 创建页面
  const page = await browser.newPage();

  // 访问百度首页,并等待网络空闲
  await page.goto('https://www.baidu.com', {
    
    waitUntil: 'networkidle0'});

  // 在搜索框中输入关键词“puppeteer”
  await page.type('#kw', 'puppeteer');

  // 点击搜索按钮
  await page.click('#su');

  // 等待搜索结果的列表出现
  await page.waitFor('#content_left');

  // 获取搜索结果的第一条链接的标题和网址
  const firstResult = await page.evaluate(() => {
    
    
    // 获取第一条链接的元素
    const firstLink = document.querySelector('#content_left .result.c-container a');
    // 返回标题和网址
    return {
    
    
      title: firstLink.innerText,
      url: firstLink.href
    };
  });

  // 将标题和网址保存到一个文件中
  fs.writeFileSync('result.txt', `${
      
      firstResult.title}\n${
      
      firstResult.url}`);

  // 关闭浏览器
  await browser.close();
})();

epilogue

This article introduces how to use Puppeteer to implement dynamic web page crawling on a Node JS server, and gives a simple case. Puppeteer is a powerful and flexible library that can be used to handle various complex dynamic web scraping scenarios. When using Puppeteer for dynamic web crawling, you need to pay attention to the following points:

  • Set up a suitable proxy server to avoid being blocked or restricted by the target website. You can use the high-quality proxy IP provided by Yiniu Cloud crawler proxy to improve the crawling effect.
  • Set appropriate waiting conditions to ensure that the asynchronous events on the web page are completed before proceeding to the next step. You can use the page.waitFor method to set waiting conditions, such as elements, functions, time, etc.
  • Set up appropriate exception handling to deal with errors or exceptions that may occur. Errors or exceptions can be caught and handled using the try...catch statement.

I hope this article was helpful to you, and if you have any questions or suggestions, please leave a comment below. Thanks!

Guess you like

Origin blog.csdn.net/ip16yun/article/details/132475781