JSOUP, the Jquery parsing artifact in the JAVA language world. It provides a very low-effort API for fetching and manipulating data through DOM, CSS, and jQuery -like manipulation methods. You can use it:
- You can parse HTML from a URL, file or string;
- You can use DOM or CSS selectors to find and retrieve data;
- Can manipulate HTML elements, attributes, text;
At the same time, jsoup is released based on the MIT protocol and can be used in commercial projects with confidence.
Simple to use:
1. Convert html to document object
Example: Doucument doucment=Jsoup.parse(html);
2. Use of selectors
Example: document.select(".comProfile p").html();//Get the content in the p element whose class is comProfile.
3. Introduction to selectors
Once an element is selected, you can perform corresponding operations on the element. Then select elements, the important thing is the selector of jsoup:
Basic selector:
Tagname: find element by tag (eg: a)
ns|tag: find elements in the namespace by tags, for example: fb|name finds <fb:name> elements
#id: Find element by ID, e.g. #logo
.class: Find elements by type name, e.g. .masthead
[attribute]: Element with attributes, e.g. [href]
[^attr]: elements with a name prefix, e.g. [^data-] finds HTML5 elements with a dataset attribute
[attr=value]: element with attribute value, eg [width=500]
[attr^=value], [attr$=value], [attr*=value]: elements that contain attributes and whose value starts or ends with value or contains value, such as [href*=/path/]
[attr~=regex]: An element whose attribute value satisfies a regular expression, such as img[src~=(?i)\.(png|jpe?g)]
*: all elements such as *
Combination selector
el#id: an element with an ID, such as div#logo
el.class: elements with a class, eg. div.masthead
el[attr]: an element containing an attribute, such as a[href]
Any combination: e.g. a[href].highlight
ancestor child: a child element inherited from an ancestor (parent) element, for example .body p finds the p element under the "body" block
parent > child: the child element that is directly the descendant of the parent element, for example: div.content > p finds the child element p of the div whose class name is content
, body > * to find immediate children of the body element
siblingA + siblingB: Find sibling elements that are preceded by sibling element A, e.g. div.head + div
siblingA ~ siblingX: Find the sibling element X that precedes the sibling element A, such as h1 ~ p
el, el, el: a combination of multiple selectors to find the only element that matches any of the selectors, such as div.masthead, div.logo
Pseudo selector:
: lt(n): Find sibling elements whose index value (that is, the position in the DOM tree relative to its parent element) is less than n, e.g. td:lt(3)
:gt(n): Find the sibling element with index value greater than n, such as div p:gt(2)
:eq(n) : Find sibling elements with index equal to n, for example form input:eq(1)
:has(seletor): Find elements that match the selector containing element, e.g. div:has(p)
:not(selector): Find elements that do not match the selector, e.g. div:not(.logo)
:contains(text): Find elements that contain the given text, case sensitive, e.g. p:contains(jsoup)
:containsOwn(text): Find elements that directly contain the given text
:matches(regex): Find elements whose text matches the specified regular expression, such as div:matches((?i)login)
:matchesOwn(regex): Find elements whose own text matches the specified regular expression
Note: The above pseudo selectors are index bases starting from 0