Jsoup study notes of Java crawler technology

Section 1 Introduction to Jsoup

1. jsoup is a Java HTML parser, which can directly parse a URL address and HTML text content. It provides a very low-effort API for fetching and manipulating data through DOM, CSS, and jQuery-like manipulation methods.

2. Create a maven project

Add jar package to pom

 <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
       <dependency>
          <groupId>org.apache.httpcomponents</groupId>
          <artifactId>httpclient</artifactId>
          <version>4.5.2</version>
       </dependency>
       
       <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
       <dependency>
         <groupId>org.jsoup</groupId>
         <artifactId>jsoup</artifactId>
         <version>1.8.3</version>
       </dependency>

Take blog garden as an example

result

mistake:

Description Resource Path Location Type

Missing artifact commons-codec:commons-codec:jar:1.10 pom.xml /Jsoup line 1 Maven Dependency Problem

Solution: add a <dependencyManagement> tag

<dependencyManagement>

<dependencies>

Dependency package

</dependencies>

</dependencyManagement>

The second section Jsoup finds DOM elements

getElementById(String id) Query DOM according to id

getElementsByTag(String tagName) Query the DOM according to the tag name

getElementsByClass(String className) Query the DOM according to the style name getElementsByAttribute(String key) Query the DOM according to the attribute name getElementsByAttributeValue(String key,String value) Query the DOM according to the attribute name and attribute value

Still take the blog garden as an example

1. getElementsByClass(String className) Query the DOM according to the style name