Java HTML parser [jsoup]

are p:

jsoup achieve WHATWG HTML5 specification, and HTML parsing and modern browser DOM is the same.
1. From the grab URL, file, or HTML string and parses
2, using DOM traversal or CSS selectors find and extract data
3, the processing HTML elements, attributes and text
4, clear the contents of the user pursuant to Security whitelist in order to prevent XSS attacks
5, the output clean HTML

Download jar package:

下载并安装jsoup[官网下载地址Jsoup.jar](https://jsoup.org/download)
Maven地址:
	<dependency>
 		<groupId>org.jsoup</groupId>
		 <artifactId>jsoup</artifactId>
		<version>1.12.1</version>
</dependency>

文档地址[官方文档](https://jsoup.org/cookbook/introduction/parsing-a-document)

From the string parsing document

String html = “First parse

  • Parsed HTML into a doc.

    ”;
    Document doc = Jsoup.parse(html);

Analytical body fragments

String html = “

Lorem Ipsum.

";
Document doc = Jsoup.parseBodyFragment (html);
Doc.body body element = ();

使用 Jsoup.parseBodyFragment (String html) 方法.

Load the document from a URL

You need to get online and parse the HTML document, and then look at where the data (screen capture)

Use Jsoup.connect (String url) method:

Document doc = Jsoup.connect(“http://example.com/”).get();
String title = doc.title();

Load the document from a file

Use static Jsoup.parse (File in, String charsetName, String baseUri) method:

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, “UTF-8”, “http://example.com/”);

Browse documents using DOM methods

After the HTML parsing to use a method similar to the DOM Document.

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, “UTF-8”, “http://example.com/”);

Element content = doc.getElementById(“content”);
Elements links = content.getElementsByTag(“a”);
for (Element link : links) {
String linkHref = link.attr(“href”);
String linkText = link.text();
}

Modify

Set property values

The method of using the property setter Element.attr (String key, String value) and Elements.attr (String key, String value).

If you need to modify the properties of the class element, use Element.addClass (String className) and Element.removeClass (String className) method.

Clean up HTML

Cleanup untrusted HTML (to prevent XSS)

The jsoup HTML Cleaner for a specified configuration Whitelist.

String unsafe =

Link

”;
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now:

Link

Keyword API

发布了26 篇原创文章 · 获赞 0 · 访问量 713

Guess you like

Origin blog.csdn.net/YHM_MM/article/details/103495612