Label text parsing: Jsoup parsing

Jsoup parsing XML text

jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very low-effort API for fetching and manipulating data via DOM, CSS, and jQuery-like manipulation methods. This article mainly introduces how to use jsoup to perform common HTML parsing.

Basic operations of Jsoup:

  try {
            String url = "C:\\Users\\admin\\Desktop\\Files\\input.txt";
            File file = new File(url);
            //Get the Document object according to the specified character set
            Document document = Jsoup.parse(file, "UTF-8");
            //Get the title tag value of the text
            String title = document.title();
            System.out.println("title:" + title);
            //Get the path of the data source
            String s = document.baseUri ();
            System.out.println("baseUri:" + s);
            //Get the tag element whose attribute is src
            Elements elementsBySrc = document.getElementsByAttributeStarting("src");
            ListIterator> <Element> elementListIterator = elementsBySrc.listIterator ();
            while (elementListIterator.hasNext()) {
                Element element = elementListIterator.next();
                // extract the value of the src attribute
                String src = element.attr("src");
                System.out.println("src:" + src);
            }
            //Get the tag element with the href attribute
            Elements elementsByHref = document.getElementsByAttribute("href");
            ListIterator> <Element> listIteratorHref = elementsByHref.listIterator ();
            while (listIteratorHref.hasNext()) {
                Element element = listIteratorHref.next();
                // extract the value of the href attribute
                String href = element.attr("href");
                System.out.println("href:" + href);
                // get its own text
                String ownText = element.ownText();
                System.out.println("ownText:" + ownText);
                //Get the text combined with its own text and descendants
                String text = element.text();
                System.out.println("text:" + text);
                //Get all properties of the node
                Attributes attributes = element.attributes();
                // convert to queue
                List<Attribute> attributes1 = attributes.asList();
                for (Attribute attribute : attributes1) {
                    // extract property name
                    String key = attribute.getKey();
                    // extract attribute value
                    String value = attribute.getValue();
                    System.out.println("gaosi:::" + key + ":" + value);
                }
            }
        } catch (IOException e) {
            e.printStackTrace ();
        }

 The power of Jsoup lies in its retrieval of document elements. The Select method will return a collection of Elements and provide a set of methods to extract and process the results, namely Jsoup's selector syntax.

1, Selector selector basic syntax

tagname: Find elements by tags, such as: a

ns|tag: Find elements in the namespace by tags, for example, you can use the fb|name syntax to find <fb:name> elements

#id: Find element by ID, for example: #logo

.class: Find elements by class name, e.g. .masthead

[attribute]: Use attributes to find elements, such as: [href]

[^attr]: Use the attribute name prefix to find elements, for example: you can use [^data-] to find elements with HTML5 Dataset attributes

[attr=value]: Use attribute value to find elements, for example: [width=500]

[attr^=value], [attr$=value], [attr*=value]: Find elements with matching attribute values ​​at the beginning, end or containing attribute values, for example: [href*=/path/]

[attr~=regex]: Use attribute values ​​to match regular expressions to find elements, for example: img[src~=(?i)\.(png|jpe?g)]

*: this symbol will match all elements

2. Selector selector combination syntax

el#id: element + ID, for example: div#logo

el.class: element + class, for example: div.masthead

el[attr]: element + class, for example: a[href]

Any combination, for example: a[href].highlight

ancestor child: Find the child elements under a certain element, for example: you can use .body p to find all p elements under the "body" element

parent > child: Find the direct child elements under a parent element, for example: you can use div.content > p to find the p element, or you can use body > * to find all direct child elements under the body tag

siblingA + siblingB: Find the first sibling element B before the A element, for example: div.head + div

siblingA ~ siblingX: Find the sibling X element before the A element, for example: h1 ~ p

el, el, el: a combination of multiple selectors to find the only element that matches any of the selectors, for example: div.masthead, div.logo

3. Selector pseudo selector syntax

:lt(n): Find which element's sibling index value (its position in the DOM tree is relative to its parent node) is less than n, for example: td:lt(3) for elements with less than three columns

:gt(n): Find which elements have a sibling index value greater than n, for example: div p:gt(2) indicates which divs contain more than 2 p elements

:eq(n): Find which elements have a sibling index value equal to n, for example: form input:eq(1) means a Form element containing an input tag

:has(seletor): Find elements that match the element containing the selector, for example: div:has(p) indicates which divs contain the p element

:not(selector): Find elements that do not match the selector, eg: div:not(.logo) means a list of all divs that do not contain the class=logo element

:contains(text): Find elements that contain the given text, the search is case-insensitive, for example: p:contains(jsoup)

:containsOwn(text): Find elements that directly contain the given text

:matches(regex): Find which elements of the text match the specified regular expression, for example: div:matches((?i)login)

:matchesOwn(regex): Finds elements whose own text matches the specified regular expression

 



 

 Sample code:

try {
            String url = "C:\\Users\\admin\\Desktop\\Files\\input.txt";
            File file = new File(url);
            //根据指定的字符集获取Document 对象
            Document document = Jsoup.parse(file, "UTF-8");
            //查询a标签集合
//            Elements elements = document.select("a");
            //查询id为gaosi的集合
//            Elements elements = document.select("#gaosi");
            //通过class属性值查找
//            Elements elements = document.select(".mainbody");
            //通过属性查找
//            Elements elements = document.select("[href]");
            //利用属性值查条件找
            Elements elements = document.select("[leftmargin=0]");
            Iterator<Element> iterator = elements.iterator();
            while (iterator.hasNext()) {
                Element element = iterator.next();
                String html = element.html();
                System.out.println("html:" + html);
            }

        } catch (Exception e) {
            e.printStackTrace();
        }

 
 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326746784&siteId=291194637