Introduction to methods of parsing HTML in Node.js

Parsing HTML is a common task in web development, especially when we need to extract data from a web page or manipulate the DOM. Mastering the various ways of parsing HTML in Node.js can greatly improve our efficiency in extracting and processing web page data. This article explains how to parse HTML in Node.js.

basic concept

HTML parsing refers to the conversion of HTML text into a manipulable data structure, usually a DOM (Document Object Model). DOM is a tree structure that represents the structure and content of a web page, allowing us to use JavaScript to manipulate and modify web pages.

Commonly used HTML parsing methods

The following are several HTML parsing methods commonly used in Node.js:

1. Cheerio: Cheerio is a jQuery-like library that parses HTML and manipulates the DOM using CSS selectors on the server side. It is suitable for parsing static HTML pages.

2.jsdom: jsdom is a library that simulates a DOM environment in Node.js. It can parse and manipulate HTML while also supporting many features that emulate a browser environment, such as event handling and asynchronous requests.

3.htmlparser2: htmlparser2 is a fast HTML parser that can parse HTML documents into a stream of DOM nodes. It is typically used for processing large HTML documents or streaming data.

Practical example: Parsing HTML with Cheerio

The following is a practical example of using Cheerio to parse HTML, including basic routing and request processing. Make sure Node.js and npm are installed in your development environment.

1. First, create a new folder and run the following command in the folder to initialize the project:

npm init -y

2. Install the required dependent libraries:

npm install express cheerio axios

3. Create a file named index.js and write the following code:

const express = require('express');
const axios = require('axios');
const cheerio = require('cheerio');  // 引入 cheerio 库,用于解析 HTML

const app = express();
const PORT = 3000;

app.get('/', async (req, res) => {
  try {
    // 使用 Axios 发起 GET 请求获取网页的 HTML 内容
    const response = await axios.get('https://apifox.com/blog/mock-manual/'); // 替换为你想要解析的网页 URL
    const html = response.data;  // 获取响应中的 HTML 内容
    
    const $ = cheerio.load(html);  // 将 HTML 文本传递给 cheerio,创建一个类似于 jQuery 的对象
    
    // 使用 cheerio 对象的选择器来获取网页标题,并提取文本内容
    const title = $('title').text();  
    
    res.send(`Title: ${title}`);  // 将标题作为响应发送给客户端
  } catch (error) {
    console.error(error);
    res.status(500).send('An error occurred');  // 发生错误时发送错误响应
  }
});

app.listen(PORT, () => {
  console.log(`Server is running on port ${PORT}`);  // 启动服务器并监听指定端口
});

In the above code, comments explain what each key step does:

  • Initiate a GET request through axios.get() to obtain the HTML content of the web page.
  • Using Cheerio's $ = cheerio.load(html) creates a Cheerio object that can be used to select DOM elements.
  • Get the text content of the element via $() using a jQuery-like selector. <title>
  • Finally, send the extracted headers to the client as a response. In this case, we use Express to create a simple server. When accessing the root route, we use Axios to get the HTML content of the web page, and then use Cheerio parses and extracts web page titles. Visit http://localhost:3000/ in a browser or API tool and you will see the response.

Tips, Tricks & Considerations

  • When using Cheerio, jsdom, or htmlparser2, it's important to understand their documentation and usage to take full advantage of their capabilities.
  • When parsing complex dynamic pages, consider using a library that emulates browser behavior, such as Puppeteer.

Use interface tools to debug backend interfaces

TakeApifox as an example, Apifox = Postman + Swagger + Mock + JMeter, Apifox supports debugging http(s), WebSocket, Socket, Interfaces for protocols such as gRPC and Dubbo, and integrated with the IDEA plug-in. When the back-end personnel finish writing the service interface, Apifox can be used to verify the correctness of the interface during the testing phase. The graphical interface greatly facilitates the efficiency of project launch.

In the example of this article, the interface can be tested through Apifox. After creating a new project, select "Debug Mode" in the project. After filling in the request address, you can quickly send the request and get the response result, as mentioned above. The practical case is shown in the figure:

Summarize

Node.js provides several methods for parsing HTML, including Cheerio, jsdom, and htmlparser2. Choose a library that suits your needs to easily manipulate and extract web content.

Knowledge expansion:

Reference links:

Guess you like

Origin blog.csdn.net/LiamHong_/article/details/134202208