SpringBoot, Java use Jsoup to parse HTML pages

Use Jsoup to parse HTML pages

insert image description here

What is Jsoup?

Jsoup is a Java library for processing HTML pages that provides a simple API that makes it easy to extract data from HTML. Whether it's getting the content of a specific tag or traversing the elements of an entire page, Jsoup can do it with ease.

How to parse HTML pages using Jsoup?

First, make sure you have added the Jsoup dependency to your Java project. You can add the following dependencies in Maven or Gradle:

  <dependency>
      <groupId>org.jsoup</groupId>
      <artifactId>jsoup</artifactId>
      <version>1.15.3</version>
  </dependency>

Then, you can follow the steps below to use Jsoup to parse HTML pages:

step:

Step 1: Import the Jsoup class

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

Step 2: Get the page content and parse it into a Document object**

String url = "https://example.com"; // 替换为你想要解析的页面 URL
Document document = Jsoup.connect(url).get();

Step 3: Use a selector to get a specific element

Jsoup uses a syntax similar to CSS selectors to select and target page elements. Here are some examples of commonly used selectors:

  • Select elements of a specific tag:
Elements links = document.select("a"); // 获取所有 <a> 标签
  • Select elements with a specific class attribute:
Elements articles = document.select(".article"); // 获取所有 class="article" 的元素
  • Select elements with a specific id attribute:
Element header = document.select("#header"); // 获取 id="header" 的元素

Step 4: Traverse elements and extract content

Element.text(); // 获取链接文本
Element.attr("href"); // 获取链接地址
Element.val(); // 获取连接value值

The function and usage of common methods of Jsoup

In the previous steps, we have briefly introduced some common methods of Jsoup. The following are detailed descriptions of some common methods:

  • Jsoup.connect(url).get(): This method is used to connect to the specified URL and parse the page content into a Document object.

  • document.select(selector): This method uses a selector to select elements that meet the criteria. Attributes such as tag name, class, id, etc. can be selected.

  • element.text(): Get the text content of the element.

  • element.attr(attributeKey): Get the value of the specified attribute of the element, which is often used to get attributes such as link address and image path.

  • element.html(): Get the HTML code inside the element.

  • element.val(): Get the value attribute value in the element.

  • element.getElementById(id): Find an element by ID, including or under it.

  • element.getElementsByClass(className): Finds elements with this class, including or under this element. not case sensitive.

  • element.getElementsByAttribute(key): Finds elements with a named attribute set. not case sensitive.

  • element.getElementsByAttributeStarting(keyPrefix): Finds elements with attribute names starting with the provided prefix. Using Data - Finds elements with HTML5 datasets.

  • element.getElementsContainingOwnText(searchText);: Finds elements that directly contain the specified string. The search is not case sensitive. Text must appear directly within the element, not within any of its descendants.

  • element.hasText(): Determines whether this element has any text content (not just whitespace).

Guess you like

Origin blog.csdn.net/weixin_45626288/article/details/132297905