Jsoup parsing XML text
jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very low-effort API for fetching and manipulating data via DOM, CSS, and jQuery-like manipulation methods. This article mainly introduces how to use jsoup to perform common HTML parsing.
Basic operations of Jsoup:
try { String url = "C:\\Users\\admin\\Desktop\\Files\\input.txt"; File file = new File(url); //Get the Document object according to the specified character set Document document = Jsoup.parse(file, "UTF-8"); //Get the title tag value of the text String title = document.title(); System.out.println("title:" + title); //Get the path of the data source String s = document.baseUri (); System.out.println("baseUri:" + s); //Get the tag element whose attribute is src Elements elementsBySrc = document.getElementsByAttributeStarting("src"); ListIterator> <Element> elementListIterator = elementsBySrc.listIterator (); while (elementListIterator.hasNext()) { Element element = elementListIterator.next(); // extract the value of the src attribute String src = element.attr("src"); System.out.println("src:" + src); } //Get the tag element with the href attribute Elements elementsByHref = document.getElementsByAttribute("href"); ListIterator> <Element> listIteratorHref = elementsByHref.listIterator (); while (listIteratorHref.hasNext()) { Element element = listIteratorHref.next(); // extract the value of the href attribute String href = element.attr("href"); System.out.println("href:" + href); // get its own text String ownText = element.ownText(); System.out.println("ownText:" + ownText); //Get the text combined with its own text and descendants String text = element.text(); System.out.println("text:" + text); //Get all properties of the node Attributes attributes = element.attributes(); // convert to queue List<Attribute> attributes1 = attributes.asList(); for (Attribute attribute : attributes1) { // extract property name String key = attribute.getKey(); // extract attribute value String value = attribute.getValue(); System.out.println("gaosi:::" + key + ":" + value); } } } catch (IOException e) { e.printStackTrace (); }
The power of Jsoup lies in its retrieval of document elements. The Select method will return a collection of Elements and provide a set of methods to extract and process the results, namely Jsoup's selector syntax.
1, Selector selector basic syntax
tagname: Find elements by tags, such as: a
ns|tag: Find elements in the namespace by tags, for example, you can use the fb|name syntax to find <fb:name> elements
#id: Find element by ID, for example: #logo
.class: Find elements by class name, e.g. .masthead
[attribute]: Use attributes to find elements, such as: [href]
[^attr]: Use the attribute name prefix to find elements, for example: you can use [^data-] to find elements with HTML5 Dataset attributes
[attr=value]: Use attribute value to find elements, for example: [width=500]
[attr^=value], [attr$=value], [attr*=value]: Find elements with matching attribute values at the beginning, end or containing attribute values, for example: [href*=/path/]
[attr~=regex]: Use attribute values to match regular expressions to find elements, for example: img[src~=(?i)\.(png|jpe?g)]
*: this symbol will match all elements
2. Selector selector combination syntax
el#id: element + ID, for example: div#logo
el.class: element + class, for example: div.masthead
el[attr]: element + class, for example: a[href]
Any combination, for example: a[href].highlight
ancestor child: Find the child elements under a certain element, for example: you can use .body p to find all p elements under the "body" element
parent > child: Find the direct child elements under a parent element, for example: you can use div.content > p to find the p element, or you can use body > * to find all direct child elements under the body tag
siblingA + siblingB: Find the first sibling element B before the A element, for example: div.head + div
siblingA ~ siblingX: Find the sibling X element before the A element, for example: h1 ~ p
el, el, el: a combination of multiple selectors to find the only element that matches any of the selectors, for example: div.masthead, div.logo
3. Selector pseudo selector syntax
:lt(n): Find which element's sibling index value (its position in the DOM tree is relative to its parent node) is less than n, for example: td:lt(3) for elements with less than three columns
:gt(n): Find which elements have a sibling index value greater than n, for example: div p:gt(2) indicates which divs contain more than 2 p elements
:eq(n): Find which elements have a sibling index value equal to n, for example: form input:eq(1) means a Form element containing an input tag
:has(seletor): Find elements that match the element containing the selector, for example: div:has(p) indicates which divs contain the p element
:not(selector): Find elements that do not match the selector, eg: div:not(.logo) means a list of all divs that do not contain the class=logo element
:contains(text): Find elements that contain the given text, the search is case-insensitive, for example: p:contains(jsoup)
:containsOwn(text): Find elements that directly contain the given text
:matches(regex): Find which elements of the text match the specified regular expression, for example: div:matches((?i)login)
:matchesOwn(regex): Finds elements whose own text matches the specified regular expression
Sample code:
try {
String url = "C:\\Users\\admin\\Desktop\\Files\\input.txt";
File file = new File(url);
//根据指定的字符集获取Document 对象
Document document = Jsoup.parse(file, "UTF-8");
//查询a标签集合
// Elements elements = document.select("a");
//查询id为gaosi的集合
// Elements elements = document.select("#gaosi");
//通过class属性值查找
// Elements elements = document.select(".mainbody");
//通过属性查找
// Elements elements = document.select("[href]");
//利用属性值查条件找
Elements elements = document.select("[leftmargin=0]");
Iterator<Element> iterator = elements.iterator();
while (iterator.hasNext()) {
Element element = iterator.next();
String html = element.html();
System.out.println("html:" + html);
}
} catch (Exception e) {
e.printStackTrace();
}