Here write custom directory title
Jsoup study notes
Jsoup is a Java library for handling the real HTML. It provides a very convenient API, for extracting and manipulating data.
GitHub Address: https://github.com/jhy/jsoup/
JAR package Download: https://jsoup.org/download
Resolve
Jsoup There are several ways to parse HTML.
- HTML string;
- URL;
- local files;
HTML string parsing
If you have some html format string and you want to parse the contents inside you can use Jsoup.parse (String html); static method returns a Document object. Use this object can be resolved.
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
URL parsing
If you need to get parsed from the web side of the network, you can use Jsoup.connect (String url); static method returns a Connection object and use the get () or post () method to extract and parse the HTML file. Use this object can be resolved. (This method is only supported Web URL http and https protocol)
Document doc = Jsoup.connect("http://example.com")
.data("query", "Java") // 请求参数
.userAgent("Mozilla") // 设置 User-Agent
.cookie("auth", "token") // 设置 cookie
.timeout(3000) // 设置连接超时时间
.post(); // 使用 POST 方法访问 URL
Local file parsing
There is a local HTML file using a static Jsoup.parse (File in, String charsetName, String baseUri) method. There is a position overloaded method parse (File in, String charsetName) using the file as baseUri.
File input = new File("/tmp/input.html");
// baseUri 解析器使用该参数在 <base href> 找到元素之前解析文档中的相对 URL 。如果您不关心这一点,则可以传递空字符串。
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Analytical data
DOM parsing
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8");
// 获得 id 为 content 的元素
Element content = doc.getElementById("content");
// 获得标签为 a 的所有元素
Elements links = content.getElementsByTag("a");
for (Element link : links) {
// 获得 href 属性的值
String linkHref = link.attr("href");
// 返回文本
String linkText = link.text();
}
The following are some common api method:
Method name | description |
---|---|
getElementById(String id) | Find elements by ID |
getElementsByTag(String tag) | Use the specified label name lookup elements |
getElementsByClass(String className) | Find elements with this Class of |
getElementsByAttribute(String key) | Find elements with this attribute set |
Select Use parse
If you want to use CSS or similar jquery selector syntax to find or operating elements. Use Element.select (String selector) and Elements.select (String selector) method.
File input = new File("D:\test.html");
Document doc = Jsoup.parse(input,"UTF-8");
// 具有 href 属性的链接
Elements links = doc.select("a[href]");
// 所有 src 包含.png 的图片
Elements pngs = doc.select("img[src$=.png]");
// 找出所有 class=masthead 的元素
Element masthead = doc.select("div.masthead").first();
// 所有符合 <h3 class=r><a href="">...</a></h3> 的元素
Elements resultLinks = doc.select("h3.r > a");
Select Overview
parameter | description |
---|---|
ns | ng | Use label positioning namespace, such as fb: name to find fb: name element |
#id | Id using positioning elements, e.g. #logo |
.class | Class attribute positioning element, e.g. .head |
tagName | Use name tags to locate, such as div |
[attribute] | Element attributes used to locate, for example, [href] represents retrieves all elements that have an href attribute |
[^attr] | Element attributes name prefix used to locate, for example, [^ data-] is used to find the dataset attributes HTML5 |
[attr=value] | Positioning using attribute values, for example, [width = 500] is positioned all the width attribute value element 500 |
[attr^=value] | Property value to begin with |
[attr$=value] | Attribute value to the end |
[attr*=value] | Property contains the value |
[attr~=regex] | Using regular expressions filter attribute values, e.g. img [src ~ = (i?). (Png |? Jpe g)] |
* | Positioning all the elements |
Select the combination usage
Filter Usage
parameter | description |
---|---|
:lt(n) | For example td: lt (3) means less than three |
:gt(n) | div p: gt (2) represented by div contains two or more p |
:eq(n) | form input: eq (1) represents the input form contains only one |
HAS (selector) | div: has§ representation contains elements div p |
:not(selector) | div: not (.logo) said they did not contain class = all div logo elements list |
:contains(text) | Contains a text element, case-insensitive, e.g. p: contains (oschina) |
:containsOwn(text) | Text information is completely equal to the specified filter criteria |
:matches(regex) | Using regular expressions for text filtering: div: matches (? (I) login) |
:matchesOwn(regex) | Use regular expressions to find their own text |
change the data
jsoup can modify the page elements, such as: add / remove element attributes. Jsoup only need to use a selector to identify the elements and can be modified by the above method, in addition to not modify the label name outside (you can delete it and then insert a new element), including attributes and text elements can be modified.
Finished modifying the direct call Element (s) of the html () method can get you modify the HTML document.
// 为所有链接增加 rel=nofollow 属性
doc.select("div.comments a").attr("rel", "nofollow");
// 为所有链接增加 class=mylinkclass 属性
doc.select("div.comments a").addClass("mylinkclass");
// 删除所有图片的 onclick 属性
doc.select("img").removeAttr("onclick");
// 清空所有文本输入框中的文本
doc.select("input[type=text]").val("");
HTML document cleanup
jsoup Whitelist using text-based filter (tab portion only allowed attributes), to prevent a malicious user inserted in the page script.
String unsafe = "<p><a href='http://www.oschina.net/' onclick='stealCookies()'>
开源中国社区 </a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// 输出 :
// <p><a href="http://www.oschina.net/" rel="nofollow"> 开源中国社区 </a></p>
Whitelist methods
Method name | Brief introduction |
---|---|
none() | Only allows text information |
basic() | 允许的标签包括:a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, strike, strong, sub, sup, u, ul, 以及合适的属性 |
simpleText() | 只允许 b, em, i, strong, u 这些标签 |
basicWithImages() | 在 basic() 的基础上增加了图片 |
relaxed() | 这个过滤器允许的标签最多,包括:a, b, blockquote, br, caption, cite, code, col, colgroup, dd, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul |
whitelist.addTags(String… tags) | 非静态方法:添加允许的标签。可以使用removeTags删除 |
whitelist.addAttributes(String tag, String… attributes) | 非静态方法:添加允许的属性,可以用removeAttributes删除 |