Official website
Chinese manual
jsoup.jar official website download
jsoup.jar Baidu network disk download Extraction code: g6ek
jsoup is a Java HTML parser, mainly used to parse HTML, and can directly parse a URL address and HTML text content. It provides a very labor-saving API, which can retrieve and manipulate data through DOM, CSS, and operation methods similar to jQuery.
从一个URL,文件或字符串中解析HTML;
使用DOM或CSS选择器来查找、取出数据;
可操作HTML元素、属性、文本;
The main class of Jsoup
org.jsoup.Jsoup 类
The Jsoup class is the entry point of any Jsoup program and will provide methods to load and parse HTML documents from various sources. Some important methods of the Jsoup class are as follows:
org.jsoup.nodes.Document类:
This class represents loading HTML documents through the Jsoup library . You can use this class to perform operations that apply to the entire HTML document. For the important methods of the Document class, see: http://jsoup.org/apidocs/org/jsoup/nodes/Document.html , the common methods of the Document class are as follows:
org.jsoup.nodes.Element类:
HTML elements are composed of tag names, attributes and child nodes . Use the Element class to extract data, and the important methods of traversing nodes and operating the HTMLElement class can be found at: http://jsoup.org/apidocs/org/jsoup/nodes/Element.html , the common methods of the Element class are as follows:
Ready to work
Create a Jsoup project, the directory structure is as follows:
to add the jar package of jsoup, put it in the Baidu network disk, the version is:jsoup-1.11.3
Crawling content case
Crawl web content
Load document from URL, use Jsoup.connect()
method to load from URL
The page information to be crawled is as follows: The
code is:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
/**
* 获取 http://www.ygdy8.net/html/gndy/index.html 页面的“国内电影下载排行“
*/
public class JsoupDemo {
public static void main(String[] args) {
//URLLoader 是一个静态方法,可以通过类名直接调用
JsoupDemo.URLLoader("http://www.ygdy8.net/html/gndy/index.html");
}
public static void URLLoader(String url){
Document d1;
try {
//通过Jsoup类中的静态方法connect返回Document对象,该document对象实际为整个html页面内容。
d1 = Jsoup.connect(url).get();
System.out.println("网页的标题是:" + d1.title() + "\n");
/**
* 1. 分析网页结构,发现我们想要的国内电影下载排行榜,所对应的class样式为co_content2,
* 2. 可以通过属性、后代选择器选中元素" div[class=‘co_content2’] ul a" ,然
* 3. 后通过Element类中的text()方法获取文本内容
*/
Elements es = d1.select("div[class='co_content2'] ul a");
//遍历得到的结果并输出内容
for (Element e : es){
System.out.println(e.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Crawl document content
Load document from file, use Jsoup.parse()
method to load HTML from file
The page to be crawled is as follows, which is an html document in my local path
The code is:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.File;
import java.io.IOException;
/**
* 从本地文件加载文档,获取淘宝主营区域包括哪些方面 文件路径为: C:\Users\vsue\Desktop\taobao.html
*/
public class JsoupDocDemo {
public static void main(String[] args) {
JsoupDocDemo.LocLoader("C:\\Users\\vsue\\Desktop\\taobao.html");
}
public static void LocLoader(String address) {
Document d2;
try {
//从文件加载文档,使用`Jsoup.parse()`方法从文件加载HTML
d2 = Jsoup.parse(new File(address), "utf-8");
System.out.println(d2.title());
Elements es = d2.select("ul[class='service-bd'] li a");
for (Element e : es){
System.out.println(e.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
The result is:
Crawl the content of String
Load document from String, use Jsoup.parse() method to load HTML from string
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupStrDemo {
public static void main(String[] args) {
String html = "<html>"
+ "<head>"
+ "<title>First parse</title>"
+ "</head>"
+ "<body>"
+ "<p>Parsed HTML into a doc.</p>"
+ "<a href='http://www.baidu.com'>百度一下</a>"
+ "</body>"
+ "</html>";
JsoupStrDemo.StringLoader(html);
}
public static void StringLoader(String html){
//从String加载文档,使用Jsoup.parse()方法从字符串加载HTML
Document d3 = Jsoup.parse(html);
String url = d3.select("a").attr("href");
System.out.println(d3.title()+" "+url);
}
}
result:
Get all links on the page
A page often contains a large number of URLs, jump to different interfaces, and then define the method to obtain all the URLs of the main page of Jingdong
Crawl all the <a>
tags in the page , and then traverse
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
/**
* 获取HTML页面中的所有链接
*/
public class JsoupAllUrlDemo {
public static void main(String[] args) {
JsoupAllUrlDemo.allUrlLoader("https://www.jd.com/");
}
public static void allUrlLoader(String address){
Document d4;
try {
d4 = Jsoup.connect(address).get();
//links包含了页面所有的连接
Elements links = d4.select("a[href]");
for (Element link : links) {
System.out.println("text : " + link.text()+"---》link : " + link.attr("href"));
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Parse a body fragment
Method
Use Jsoup.parseBodyFragment(String html) method.
String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();
The description
parseBodyFragment
method creates an empty document and inserts the parsed HTML into the body element. If you are using the normal Jsoup.parse(String html)
method, generally you can get the same results, but explicitly input by the user as the body segment, so as to ensure that any bad HTML provided by the user will be resolved into the body element.
Document.body()
The method can achieve all child elements in the document body element, and doc.getElementsByTag("body")
the same.
Data extraction
DOM method to traverse the document
After parsing HTML into a Document, you can use methods similar to DOM to operate. Sample code:
Explain that the
Elements object provides a series of methods similar to DOM to find elements, extract and process the data. details as follows:
Find element
- getElementById(String id)
- getElementsByTag(String tag)
- getElementsByClass(String className)
- getElementsByAttribute(String key) (and related methods)
- Element siblings: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(), previousElementSibling()
- Graph: parent(), children(), child(int index)
Element data
- attr(String key) get attributes attr(String key, String value) set attributes
attributes() get all attributes - id(), className() and classNames()
- text() Get text content text(String value) Set text content
- html() Get the HTML in the element html (String value) Set the HTML content in the element
- outerHtml() Get the HTML content outside the element
- data() Get data content (for example: script and style tags)
- tag() and tagName()
Manipulate HTML and text
- append(String html), prepend(String html)
- appendText(String text), prependText(String text)
- appendElement(String tagName), prependElement(String tagName)
- html(String value)
Use selector syntax to find elements
The method
can be used Element.select(String selector)
and Elements.select(String selector)
methods to find and implement operational elements:
The jsoup elements object supports selector syntax similar to CSS (or jquery) to achieve a very powerful and flexible search function. .
This select method can be used in Document , Element , or Elements objects. And it is context-sensitive, so it can achieve filtering of specified elements or chain selection access.
select
The method will return a Elements
collection and provide a set of methods to extract and process the results.
Selector overview
tagname
: Find elements by tags, such as:a
ns|tag
: Find elements in a namespace through the label, for example: You can use fb | name syntax for the<fb:name>
elements#id
: Find elements by ID, such as:#logo
.class
: Find elements by class name, such as:.masthead
[attribute]
: Use attributes to find elements, such as:[href]
[^attr]
: Use the attribute name prefix to find elements, for example: you can use[^data-]
to find elements with HTML5 Dataset attributes[attr=value]
: Use attribute values to find elements, such as:[width=500]
[attr^=value], [attr$=value], [attr*=value]
: Use matching attribute values at the beginning, end, or containing attribute values to find elements, such as:[href*=/path/]
[attr~=regex]
: Use attribute values to match regular expressions to find elements, such as:img[src~=(?i)\.(png|jpe?g)]
*
: This symbol will match all elements
Selector selector combination use
el#id
: Element + ID, such as: div#logoel.class
: Element + class, for example: div.mastheadel[attr]
: Element + class, for example: a[href]任意组合
, Such as: a[href].highlightancestor child
: Find the child elements of an element, for example: you can use .body p to find all p elements under the "body" elementparent > child
: Find the direct child elements under a parent element, for example: you can use div.content> p to find the p element, or you can use body> * to find all the direct child elements under the body tagsiblingA + siblingB
: Find the first sibling element B before the A element, for example: div.head + divsiblingA ~ siblingX
: Find the X element of the same level before the A element, for example: h1 ~ pel, el, el
: Multiple selector combination, find the only element that matches any selector, for example: div.masthead, div.logo
Pseudo selectorselectors
:lt(n)
: Find which element's sibling index value (its position in the DOM tree is relative to its parent node) is less than n, for example: td:lt(3) means an element less than three columns:gt(n)
: Find which elements have the same index value greater than n, for example: div p:gt(2) indicates which div contains more than 2 p elements:eq(n)
: Find which elements have the same index value as n, for example: form input: eq(1) means a Form element that contains an input tag:has(seletor)
: Find elements that match the element contained in the selector, for example: div:has§ indicates which div contains the p element:not(selector)
: Find elements that do not match the selector, for example: div:not(.logo) means a list of all divs that do not contain class=logo elements:contains(text)
: Find elements that contain the given text, the search does not distinguish between uppercase and lowercase, for example: p:contains(jsoup):containsOwn(text)
: Find elements that directly contain the given text:matches(regex)
: Find which elements of the text match the specified regular expression, such as: div:matches((?i)login):matchesOwn(regex)
: Find elements that contain text that matches the specified regular expression
Note: The above pseudo-selector index starts from 0, which means that the index value of the first element is 0, the index of the second element is 1, etc.
You can check the Selector API reference for more details
Extract attributes, text and HTML from elements
After parsing to obtain a Document instance object and finding some elements, you want to get the data in these elements.
Method :
- To get the value of an attribute, you can use the
Node.attr(String key)
method - For the text in an element, you can use the
Element.text()
method - To get to the element or attribute in HTML content, you can use
Element.html()
, orNode.outerHtml()
method
The above method is the core method of element data access. In addition, some other methods can be used:
Element.id()
Element.tagName()
Element.className()
AndElement.hasClass(String className)
these accessor methods have corresponding setter methods to change the data.
change the data
problem
在你解析一个Document之后可能想修改其中的某些属性值,
然后再保存到磁盘或都输出到前台页面。
Method
Property setting method can be used Element.attr(String key, String value)
, and Elements.attr(String key, String value)
.
If you need to modify the class attribute of an element can be used Element.addClass(String className)
and Element.removeClass(String className)
methods.
Elements provides methods to manipulate element attributes and classes in batches. For example, to add a rel="nofollow" to each a element in the div, you can use the following methods:
doc.select("div.comments a").attr("rel", "nofollow");
Explanation
Like other methods in Element, the attr method also returns the Element (or returns the Elements collection when using the selector) . This can be very convenient to use the method of writing. such as:
doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");
Set the HTML content of an element
Description
Element.html(String html)
This method will first clear the HTML content in the element and then replace it with the passed HTML.Element.prepend(String first)
And aElement.append(String last)
method for adding HTML content, respectively before and after the internal elements of HTMLElement.wrap(String around)
Wrap an external HTML content to the element.
Set the text content of the element
Method
You can use the setting method of Element to modify the text content in an HTML document:
Note The
text setting method is the same as the HTML setter method:
Element.text(String text)
Will clear the internal HTML content in an element, and then replace the provided textElement.prepend(String first)
AndElement.append(String last)
the nodes are added before and after the text inside the html element.
If the incoming text contains characters like <, >, etc., it will be processed as text instead of HTML.