HTML & XML parser
A .Jsoup Overview
-
Jsoup Profile
jsoup is a Java HTML parser can parse a URL address directly, HTML text, which provides a very labor-saving API, you can, CSS and jQuery similar methods of operation and the operation to remove the data through DOM.
Similarly, Jsoup can parse HTML, can parse XML
-
The main function of Jsoup
- Parsing from a URL, file or string in HTML (XML)
- Using DOM or CSS selectors to find and retrieve data
- You can manipulate HTML / XML elements, attributes, and text
-
Jsoup main classes
- Jsoup: tools, can parse html or xml document to return a Document
- Document: Document object that represents the memory DOM tree
- Element: element object
- Elements: Element object is a collection of elements and to be used as ArrayList <Element> used
- Node: node object, a parent Document and Element
Two .Jsoup applications
-
Quickstart Steps
- Import the relevant jar package
- Gets the Document object
- Element object corresponding to the acquired label
- retrieve data
XML file student.xml
<students> <student number="0001"> <name id="itcast"> <xing>张</xing> <ming>三</ming> </name> <age>18</age> <sex>male</sex> </student> <student number="0002"> <name>jack</name> <age>18</age> <sex>female</sex> </student> </students>
package com.zzy.www.JsoupTest; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.io.File; import java.io.IOException; public class JsoupDemo1 { public static void main(String[] args) throws IOException { String xmlPath = JsoupDemo1.class.getClassLoader().getResource("student.xml").getPath(); // 获取Document对象 Document document = Jsoup.parse(new File(xmlPath), "utf-8"); // 获取对应标签的Element对象 获取name标签 Elements ele = document.getElementsByTag("name"); // System.out.println(ele); // 获取第一个name的Element对象 Element e1 = ele.get(0); System.out.println(e1); } }
-
API related objects using presentation
-
Are p
Jsoup entry point class is any Jsoup program, and provides a method for loading from various sources and parse HTML / XML document.
The main methods are as follows:
static Connection connect(String url)
: Creates and returns a Connection to the URLstatic Document parse(File in, String charsetName)
: Will develop character set into a document file parsingstatic Document parse(String html)
: Given html code to parse the documentstatic String clean(String bodyHtml,Whitelist whitelist)
: Input from the returned HTML security HTML, and filtered through a whitelist tags and attributes allowed by parsing the input HTML.
-
Document
Document represents the memory dom tree.
The main methods are:
Element getElementById(String id)
: Obtain a unique id attribute value of the element according to the objectElements getElementByTag(String tagName)
: Gets the element name of the object according to the label collectionElements getElementByAttribute(String key)
: Gets a collection of objects based on the attribute name elementElements getElementByAttributeValue(String key, String value)
: Obtaining a collection of objects according to the corresponding attribute name element and attribute values
public class JsoupDemo2 { public static void main(String[] args) throws IOException { String xmlPath = JsoupDemo2.class.getClassLoader().getResource("student.xml").getPath(); // 获取Document对象 Document doc = Jsoup.parse(new File(xmlPath), "utf-8"); // 根据ID值获取Element对象 Element eleId = doc.getElementById("itcast"); System.out.println(eleId); System.out.println("================="); // 获取所有的student对象 // 通过标签名获取 Elements eleStudents = doc.getElementsByTag("student"); System.out.println(eleStudents); System.out.println("---------------------"); // 获取属性名为id的元素对象 Elements eleAttrId = doc.getElementsByAttribute("id"); System.out.println(eleAttrId); System.out.println("+++++++++++++++++++++"); // 获取属性名为number,属性值为0002的元素对象 Elements ele = doc.getElementsByAttributeValue("number", "0002"); System.out.println(ele); } }
-
Elements
Element object set, as ArrayList <Element> Processing
-
Element
Element represents the element object
The main method has the following categories:
- Acquiring sub-element object
Element getElementById(String id)
: Obtain a unique id attribute value of the element according to the objectElements getElementByTag(String tagName)
: Gets the element name of the object according to the label collectionElements getElementByAttribute(String key)
: Gets a collection of objects based on the attribute name elementElements getElementByAttributeValue(String key, String value)
: Obtaining a collection of objects according to the corresponding attribute name element and attribute values
- Gets the property value
String attr(String key)
: Get attribute value based on the attribute name
- Get the text content
String text()
: Get the text contentString html()
: Get all the contents of the tag body (including the contents of the sub-string tag)
public class JsoupDemo3 { public static void main(String[] args) throws IOException { String xmlPath = JsoupDemo1.JsoupDemo2.class.getClassLoader().getResource("student.xml").getPath(); // 获取Document对象 Document doc = Jsoup.parse(new File(xmlPath), "utf-8"); // 先获取Element元素对象 Elements eles = doc.getElementsByTag("student"); Element ele = eles.get(0); // 获取Element元素的属性值 String attr = ele.attr("number"); System.out.println(attr); // 输出:0001 System.out.println("--------------"); // 获取元素的子元素对象 Elements ele1 = ele.getElementsByTag("name"); System.out.println(ele1); // 获取文本内容 String txt = ele1.text(); System.out.println(txt); // 输出:张 三 System.out.println("==============="); String html = ele1.html(); System.out.println(html); // 输出: // <xing> // 张 // </xing> // <ming> // 三 // </ming> } }
- Acquiring sub-element object
-
You can also use the quick and easy way, using the selector Selector
static Elements select(String cssQuery)
: Detailed syntax need to see Jsoup official document https://jsoup.org/apidocs/org/jsoup/select/Selector.html
public class JsoupDemo4 { public static void main(String[] args) throws IOException { String xmlPath = JsoupDemo4.class.getClassLoader().getResource("student.xml").getPath(); Document doc = Jsoup.parse(new File(xmlPath), "utf-8"); // 通过标签名查找 Elements eles = doc.select("name"); System.out.println(eles); System.out.println("----------------"); // 通过id值查找 Elements eles1 = doc.select("#itcast"); System.out.println(eles1); System.out.println("================"); // 通过属性值查找 number=0002 Elements eles2 = doc.select("[number='0002']"); System.out.println(eles2); } }
-
-
Learning Links
https://blog.csdn.net/weixin_34129696/article/details/91885803