HTML & XML parser --Jsoup

HTML & XML parser

A .Jsoup Overview

  1. Jsoup Profile

    jsoup is a Java HTML parser can parse a URL address directly, HTML text, which provides a very labor-saving API, you can, CSS and jQuery similar methods of operation and the operation to remove the data through DOM.

    Similarly, Jsoup can parse HTML, can parse XML

  2. The main function of Jsoup

    • Parsing from a URL, file or string in HTML (XML)
    • Using DOM or CSS selectors to find and retrieve data
    • You can manipulate HTML / XML elements, attributes, and text
  3. Jsoup main classes

    • Jsoup: tools, can parse html or xml document to return a Document
    • Document: Document object that represents the memory DOM tree
    • Element: element object
    • Elements: Element object is a collection of elements and to be used as ArrayList <Element> used
    • Node: node object, a parent Document and Element

Two .Jsoup applications

  1. Quickstart Steps

    • Import the relevant jar package
    • Gets the Document object
    • Element object corresponding to the acquired label
    • retrieve data

    XML file student.xml

    <students>
        <student number="0001">
            <name id="itcast">
                <xing>张</xing>
                <ming>三</ming>
            </name>
            <age>18</age>
            <sex>male</sex>
        </student>
        <student number="0002">
            <name>jack</name>
            <age>18</age>
            <sex>female</sex>
        </student>
    
    </students>
    
    package com.zzy.www.JsoupTest;
    
    import org.jsoup.Jsoup;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    import java.io.File;
    import java.io.IOException;
    
    public class JsoupDemo1 {
        public static void main(String[] args) throws IOException {
            String xmlPath = JsoupDemo1.class.getClassLoader().getResource("student.xml").getPath();
            // 获取Document对象
            Document document = Jsoup.parse(new File(xmlPath), "utf-8");
    
            // 获取对应标签的Element对象 获取name标签
            Elements ele = document.getElementsByTag("name");
    
    //        System.out.println(ele);
    
            // 获取第一个name的Element对象
            Element e1 = ele.get(0);
            System.out.println(e1);
        }
    }
    
  2. API related objects using presentation

    • Are p

      Jsoup entry point class is any Jsoup program, and provides a method for loading from various sources and parse HTML / XML document.

      The main methods are as follows:

      • static Connection connect(String url): Creates and returns a Connection to the URL
      • static Document parse(File in, String charsetName): Will develop character set into a document file parsing
      • static Document parse(String html): Given html code to parse the document
      • static String clean(String bodyHtml,Whitelist whitelist): Input from the returned HTML security HTML, and filtered through a whitelist tags and attributes allowed by parsing the input HTML.
    • Document

      Document represents the memory dom tree.

      The main methods are:

      • Element getElementById(String id): Obtain a unique id attribute value of the element according to the object
      • Elements getElementByTag(String tagName): Gets the element name of the object according to the label collection
      • Elements getElementByAttribute(String key): Gets a collection of objects based on the attribute name element
      • Elements getElementByAttributeValue(String key, String value): Obtaining a collection of objects according to the corresponding attribute name element and attribute values
      public class JsoupDemo2 {
          public static void main(String[] args) throws IOException {
              String xmlPath = JsoupDemo2.class.getClassLoader().getResource("student.xml").getPath();
      
              // 获取Document对象
              Document doc = Jsoup.parse(new File(xmlPath), "utf-8");
      
              // 根据ID值获取Element对象
              Element eleId = doc.getElementById("itcast");
              System.out.println(eleId);
              System.out.println("=================");
      
              // 获取所有的student对象
              // 通过标签名获取
              Elements eleStudents = doc.getElementsByTag("student");
              System.out.println(eleStudents);
              System.out.println("---------------------");
      
              // 获取属性名为id的元素对象
              Elements eleAttrId = doc.getElementsByAttribute("id");
              System.out.println(eleAttrId);
              System.out.println("+++++++++++++++++++++");
      
              // 获取属性名为number,属性值为0002的元素对象
              Elements ele = doc.getElementsByAttributeValue("number", "0002");
              System.out.println(ele);
          }
      }
      
    • Elements

      Element object set, as ArrayList <Element> Processing

    • Element

      Element represents the element object

      The main method has the following categories:

      1. Acquiring sub-element object
        • Element getElementById(String id): Obtain a unique id attribute value of the element according to the object
        • Elements getElementByTag(String tagName): Gets the element name of the object according to the label collection
        • Elements getElementByAttribute(String key): Gets a collection of objects based on the attribute name element
        • Elements getElementByAttributeValue(String key, String value): Obtaining a collection of objects according to the corresponding attribute name element and attribute values
      2. Gets the property value
        • String attr(String key): Get attribute value based on the attribute name
      3. Get the text content
        • String text(): Get the text content
        • String html(): Get all the contents of the tag body (including the contents of the sub-string tag)
      public class JsoupDemo3 {
          public static void main(String[] args) throws IOException {
              String xmlPath = JsoupDemo1.JsoupDemo2.class.getClassLoader().getResource("student.xml").getPath();
      
              // 获取Document对象
              Document doc = Jsoup.parse(new File(xmlPath), "utf-8");
      
      
              // 先获取Element元素对象
              Elements eles = doc.getElementsByTag("student");
              Element ele = eles.get(0);
      
              // 获取Element元素的属性值
              String attr = ele.attr("number");
              System.out.println(attr);   // 输出:0001
              System.out.println("--------------");
      
              // 获取元素的子元素对象
              Elements ele1 = ele.getElementsByTag("name");
              System.out.println(ele1);
      
              // 获取文本内容
              String txt = ele1.text();
              System.out.println(txt); // 输出:张 三
              System.out.println("===============");
      
              String html = ele1.html();
              System.out.println(html);
              // 输出:
      //            <xing>
      //                张 
      //                </xing> 
      //            <ming>
      //                三 
      //                </ming>
          }
      }
      
    • You can also use the quick and easy way, using the selector Selector

      • static Elements select(String cssQuery): Detailed syntax need to see Jsoup official document https://jsoup.org/apidocs/org/jsoup/select/Selector.html
      public class JsoupDemo4 {
          public static void main(String[] args) throws IOException {
              String xmlPath = JsoupDemo4.class.getClassLoader().getResource("student.xml").getPath();
      
              Document doc = Jsoup.parse(new File(xmlPath), "utf-8");
      
              // 通过标签名查找
              Elements eles = doc.select("name");
              System.out.println(eles);
              System.out.println("----------------");
      
              // 通过id值查找
              Elements eles1 = doc.select("#itcast");
              System.out.println(eles1);
              System.out.println("================");
      
              // 通过属性值查找 number=0002
              Elements eles2 = doc.select("[number='0002']");
              System.out.println(eles2);
          }
      }
      
  3. Learning Links

    https://blog.csdn.net/weixin_34129696/article/details/91885803

Guess you like

Origin www.cnblogs.com/LucasBlog/p/12577218.html