are p:
jsoup achieve WHATWG HTML5 specification, and HTML parsing and modern browser DOM is the same.
1. From the grab URL, file, or HTML string and parses
2, using DOM traversal or CSS selectors find and extract data
3, the processing HTML elements, attributes and text
4, clear the contents of the user pursuant to Security whitelist in order to prevent XSS attacks
5, the output clean HTML
Download jar package:
下载并安装jsoup[官网下载地址Jsoup.jar](https://jsoup.org/download)
Maven地址:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.12.1</version>
</dependency>
文档地址[官方文档](https://jsoup.org/cookbook/introduction/parsing-a-document)
From the string parsing document
String html = “
- “
Parsed HTML into a doc.
”;
Document doc = Jsoup.parse(html);
Analytical body fragments
String html = “
Lorem Ipsum.
";Document doc = Jsoup.parseBodyFragment (html);
Doc.body body element = ();
使用 Jsoup.parseBodyFragment (String html) 方法.
Load the document from a URL
You need to get online and parse the HTML document, and then look at where the data (screen capture)
Use Jsoup.connect (String url) method:
Document doc = Jsoup.connect(“http://example.com/”).get();
String title = doc.title();
Load the document from a file
Use static Jsoup.parse (File in, String charsetName, String baseUri) method:
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, “UTF-8”, “http://example.com/”);
Browse documents using DOM methods
After the HTML parsing to use a method similar to the DOM Document.
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, “UTF-8”, “http://example.com/”);
Element content = doc.getElementById(“content”);
Elements links = content.getElementsByTag(“a”);
for (Element link : links) {
String linkHref = link.attr(“href”);
String linkText = link.text();
}
Modify
Set property values
The method of using the property setter Element.attr (String key, String value) and Elements.attr (String key, String value).
If you need to modify the properties of the class element, use Element.addClass (String className) and Element.removeClass (String className) method.
Clean up HTML
Cleanup untrusted HTML (to prevent XSS)
The jsoup HTML Cleaner for a specified configuration Whitelist.
String unsafe =
“
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: