Jsoup introduction

What is jsoup

jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-saving API, which can retrieve and manipulate data through DOM, CSS, and operation methods similar to jQuery.

The role of jsoup

The main functions of jsoup are as follows:

  1. Parse HTML from a URL, file or string;
  2. Use DOM or CSS selectors to find and retrieve data;
  3. Operable HTML elements, attributes, and text;

How to use jsoup

Introduce dependencies

<!--Jsoup-->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.3</version>
</dependency>

Use dom to traverse the document

Element acquisition

  1. Query element getElementById according to id
  2. Get elements by tag getElementsByTag
  3. Get elements according to class getElementsByClass
  4. Get elements by attribute getElementsByAttribute
 //    解析url地址
Document document = Jsoup.parse(new URL("http://www.baidu.com/"), 1000);

//1.    根据id查询元素getElementById
Element element = document.getElementById("city_bj");

//2.   根据标签获取元素getElementsByTag
element = document.getElementsByTag("title").first();

//3.   根据class获取元素getElementsByClass
element = document.getElementsByClass("s_name").last();

//4.   根据属性获取元素getElementsByAttribute
element = document.getElementsByAttribute("abc").first();
element = document.getElementsByAttributeValue("class", "city_con").first();

Get data from the element

  1. Get id from element
  2. Get the className from the element
  3. Get the value of the attribute from the element attr
  4. Get all attributes from the element
  5. Get the text content text from the element
//获取元素
Element element = document.getElementById("test");

//1.   从元素中获取id
String str = element.id();

//2.   从元素中获取className
str = element.className();

//3.   从元素中获取属性的值attr
str = element.attr("id");

//4.   从元素中获取所有属性attributes
str = element.attributes().toString();

//5.   从元素中获取文本内容text
str = element.text();

Find elements using selector syntax

The jsoup elements object supports selector syntax similar to CSS (or jquery) to achieve a very powerful and flexible search function. This select method can be used in Document, Element, or Elements objects. And it is context-sensitive, so it can achieve filtering of specified elements or chain selection access.

The Select method will return a collection of Elements and provide a set of methods to extract and process the results.

Selector overview

tagname:通过标签查找元素,比如:span

#id: 通过ID查找元素,比如:#city_bj

.class: 通过class名称查找元素,比如:.class_a

[attribute]:利用属性查找元素,比如:[abc]

[attr=value]:利用属性值来查找元素,比如:[class=s_name]
//tagname: 通过标签查找元素,比如:span
Elements span = document.select("span");
for (Element element : span) {
    System.out.println(element.text());
}

//#id: 通过ID查找元素,比如:#city_bjj
String str = document.select("#city_bj").text();

//.class: 通过class名称查找元素,比如:.class_a
str = document.select(".class_a").text();

//[attribute]: 利用属性查找元素,比如:[abc]
str = document.select("[abc]").text();

//[attr=value]: 利用属性值来查找元素,比如:[class=s_name]
str = document.select("[class=s_name]").text();

Selector selector combination use

el#id: 元素+ID,比如: h3#city_bj
el.class: 元素+class,比如: li.class_a
el.class.class: 查找有多个class的元素
el[attr]: 元素+属性名,比如: span[abc]
任意组合: 比如:span[abc].s_name
ancestor child: 查找某个元素下子元素,比如:.city_con li 查找"city_con"下的所有li
parent > child: 查找某个父元素下的直接子元素,比如:
.city_con > ul > li 查找city_con第一级(直接子元素)的ul,再找所有ul下的第一级li
parent > *: 查找某个父元素下所有直接子元素


//el#id: 元素+ID,比如: h3#city_bj
String str = document.select("h3#city_bj").text();

//el.class: 元素+class,比如: li.class_a
str = document.select("li.class_a").text();

//el[attr]: 元素+属性名,比如: span[abc]
str = document.select("span[abc]").text();

//任意组合,比如:span[abc].s_name
str = document.select("span[abc].s_name").text();

//ancestor child: 查找某个元素下子元素,比如:.city_con li 查找"city_con"下的所有li
str = document.select(".city_con li").text();

//parent > child: 查找某个父元素下的直接子元素,
//比如:.city_con > ul > li 查找city_con第一级(直接子元素)的ul,再找所有ul下的第一级li
str = document.select(".city_con > ul > li").text();

//parent > * 查找某个父元素下所有直接子元素.city_con > *
str = document.select(".city_con > *").text();

Guess you like

Origin blog.csdn.net/kaihuishang666/article/details/105032884