JSoup parsing html

    JSOUP, the Jquery parsing artifact in the JAVA language world. It provides a very low-effort API for fetching and manipulating data through DOM, CSS, and jQuery -like manipulation methods. You can use it:

  •  You can parse HTML from a URL, file or string;
  • You can use DOM or CSS selectors to find and retrieve data;
  • Can manipulate HTML elements, attributes, text;

At the same time, jsoup is released based on the MIT protocol and can be used in commercial projects with confidence.

Simple to use:

1. Convert html to document object

      Example: Doucument doucment=Jsoup.parse(html);       

2. Use of selectors

     Example: document.select(".comProfile p").html();//Get the content in the p element whose class is comProfile.

3. Introduction to selectors

    Once an element is selected, you can perform corresponding operations on the element. Then select elements, the important thing is the selector of jsoup:

        Basic selector:

            Tagname: find element by tag (eg: a)

            ns|tag: find elements in the namespace by tags, for example: fb|name finds <fb:name> elements

            #id: Find element by ID, e.g. #logo   

            .class: Find elements by type name, e.g. .masthead    

            [attribute]: Element with attributes, e.g. [href]

            [^attr]: elements with a name prefix, e.g. [^data-] finds HTML5 elements with a dataset attribute

            [attr=value]: element with attribute value, eg [width=500]    

            [attr^=value], [attr$=value], [attr*=value]: elements that contain attributes and whose value starts or ends with value or contains value, such as [href*=/path/]    

            [attr~=regex]: An element whose attribute value satisfies a regular expression, such as img[src~=(?i)\.(png|jpe?g)]

            *: all elements such as *

         Combination selector

                el#id: an element with an ID, such as div#logo

                el.class: elements with a class, eg. div.masthead

                el[attr]: an element containing an attribute, such as a[href]

                Any combination: e.g. a[href].highlight

                ancestor child: a child element inherited from an ancestor (parent) element, for example .body p finds the p element     under the "body" block

                parent > child: the child element that is directly the descendant of the parent element, for example: div.content > p finds the child element p of the div whose class name is content

                                        , body > * to find immediate children of the body element

                siblingA + siblingB: Find sibling elements that are preceded by sibling element A, e.g. div.head + div

                siblingA ~ siblingX: Find the sibling element X that precedes the sibling element A, such as h1 ~ p

                el, el, el: a combination of multiple selectors to find the only element that matches any of the selectors, such as div.masthead, div.logo

        Pseudo selector:

                : lt(n): Find sibling elements whose index value (that is, the position in the DOM tree relative to its parent element) is less than n, e.g. td:lt(3)

                :gt(n): Find the sibling element with index value greater than n, such as div p:gt(2)

                :eq(n) : Find sibling elements with index equal to n, for example form input:eq(1)

                :has(seletor): Find elements that match the selector containing element, e.g. div:has(p)

                :not(selector): Find elements that do not match the selector, e.g. div:not(.logo)

                :contains(text): Find elements that contain the given text, case sensitive, e.g. p:contains(jsoup)

                :containsOwn(text): Find elements that directly contain the given text

                :matches(regex): Find elements whose text matches the specified regular expression, such as div:matches((?i)login)

                :matchesOwn(regex): Find elements whose own text matches the specified regular expression

        Note: The above pseudo selectors are index bases starting from 0


Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325545650&siteId=291194637