Jsoup study notes

Jsoup study notes

Jsoup is a Java library for handling the real HTML. It provides a very convenient API, for extracting and manipulating data.
GitHub Address: https://github.com/jhy/jsoup/
JAR package Download: https://jsoup.org/download

Resolve

Jsoup There are several ways to parse HTML.

  1. HTML string;
  2. URL;
  3. local files;

HTML string parsing

If you have some html format string and you want to parse the contents inside you can use Jsoup.parse (String html); static method returns a Document object. Use this object can be resolved.

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

URL parsing

If you need to get parsed from the web side of the network, you can use Jsoup.connect (String url); static method returns a Connection object and use the get () or post () method to extract and parse the HTML file. Use this object can be resolved. (This method is only supported Web URL http and https protocol)

Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")	// 请求参数
  .userAgent("Mozilla")		// 设置 User-Agent 
  .cookie("auth", "token")	// 设置 cookie
  .timeout(3000)			// 设置连接超时时间
  .post();		            // 使用 POST 方法访问 URL

Local file parsing

There is a local HTML file using a static Jsoup.parse (File in, String charsetName, String baseUri) method. There is a position overloaded method parse (File in, String charsetName) using the file as baseUri.

File input = new File("/tmp/input.html");
// baseUri 解析器使用该参数在 <base href> 找到元素之前解析文档中的相对 URL 。如果您不关心这一点,则可以传递空字符串。
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Analytical data

DOM parsing

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8");

// 获得 id 为 content 的元素
Element content = doc.getElementById("content");
// 获得标签为 a 的所有元素
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  // 获得 href 属性的值
  String linkHref = link.attr("href");
  // 返回文本
  String linkText = link.text();
}

The following are some common api method:

Method name description
getElementById(String id) Find elements by ID
getElementsByTag(String tag) Use the specified label name lookup elements
getElementsByClass(String className) Find elements with this Class of
getElementsByAttribute(String key) Find elements with this attribute set

Select Use parse

If you want to use CSS or similar jquery selector syntax to find or operating elements. Use Element.select (String selector) and Elements.select (String selector) method.

File input = new File("D:\test.html"); 
Document doc = Jsoup.parse(input,"UTF-8"); 
// 具有 href 属性的链接
Elements links = doc.select("a[href]"); 
// 所有 src 包含.png 的图片
Elements pngs = doc.select("img[src$=.png]");
// 找出所有 class=masthead 的元素
Element masthead = doc.select("div.masthead").first(); 
// 所有符合 <h3 class=r><a href="">...</a></h3> 的元素
Elements resultLinks = doc.select("h3.r > a"); 

Select Overview

parameter description
ns | ng Use label positioning namespace, such as fb: name to find fb: name element
#id Id using positioning elements, e.g. #logo
.class Class attribute positioning element, e.g. .head
tagName Use name tags to locate, such as div
[attribute] Element attributes used to locate, for example, [href] represents retrieves all elements that have an href attribute
[^attr] Element attributes name prefix used to locate, for example, [^ data-] is used to find the dataset attributes HTML5
[attr=value] Positioning using attribute values, for example, [width = 500] is positioned all the width attribute value element 500
[attr^=value] Property value to begin with
[attr$=value] Attribute value to the end
[attr*=value] Property contains the value
[attr~=regex] Using regular expressions filter attribute values, e.g. img [src ~ = (i?). (Png |? Jpe g)]
* Positioning all the elements

Select the combination usage

parameter description
el # id Location id value of an element, for example a # logo corresponding to
el.class Positioning element class specified value, for example corresponding to div.head
xxxx
el[attr] All the positioning element defines a property, such as a [href]
Any combination of three or more 例如 a[href]#logo 、a[name].outerlink
The following five selector syntax is a combination of relationships between elements, including parent-child relationships, the combined relational and hierarchical relationships.
ancestor child Ancestor children, for example: div # page_wrapper div [class ~ = mainContent *]
parent > child For example: div.infoSet> span
siblingA + siblingB Find A element before the first sibling element B
siblingA ~ siblingX Find A sibling element before the element X
el, el, el Selecting a plurality of combinations, the only element to find any matches a selector

Filter Usage

parameter description
:lt(n) For example td: lt (3) means less than three
:gt(n) div p: gt (2) represented by div contains two or more p
:eq(n) form input: eq (1) represents the input form contains only one
HAS (selector) div: has§ representation contains elements div p
:not(selector) div: not (.logo) said they did not contain class = all div logo elements list
:contains(text) Contains a text element, case-insensitive, e.g. p: contains (oschina)
:containsOwn(text) Text information is completely equal to the specified filter criteria
:matches(regex) Using regular expressions for text filtering: div: matches (? (I) login)
:matchesOwn(regex) Use regular expressions to find their own text

change the data

jsoup can modify the page elements, such as: add / remove element attributes. Jsoup only need to use a selector to identify the elements and can be modified by the above method, in addition to not modify the label name outside (you can delete it and then insert a new element), including attributes and text elements can be modified.

Finished modifying the direct call Element (s) of the html () method can get you modify the HTML document.

// 为所有链接增加 rel=nofollow 属性
doc.select("div.comments a").attr("rel", "nofollow"); 
// 为所有链接增加 class=mylinkclass 属性
doc.select("div.comments a").addClass("mylinkclass"); 
// 删除所有图片的 onclick 属性
doc.select("img").removeAttr("onclick"); 
// 清空所有文本输入框中的文本
doc.select("input[type=text]").val(""); 

HTML document cleanup

jsoup Whitelist using text-based filter (tab portion only allowed attributes), to prevent a malicious user inserted in the page script.

String unsafe = "<p><a href='http://www.oschina.net/' onclick='stealCookies()'> 
 开源中国社区 </a></p>"; 
String safe = Jsoup.clean(unsafe, Whitelist.basic()); 
 // 输出 : 
 // <p><a href="http://www.oschina.net/" rel="nofollow"> 开源中国社区 </a></p>

Whitelist methods

Method name Brief introduction
none() Only allows text information
basic() 允许的标签包括:a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li, ol, p, pre, q, small, strike, strong, sub, sup, u, ul, 以及合适的属性
simpleText() 只允许 b, em, i, strong, u 这些标签
basicWithImages() 在 basic() 的基础上增加了图片
relaxed() 这个过滤器允许的标签最多,包括:a, b, blockquote, br, caption, cite, code, col, colgroup, dd, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul
whitelist.addTags(String… tags) 非静态方法:添加允许的标签。可以使用removeTags删除
whitelist.addAttributes(String tag, String… attributes) 非静态方法:添加允许的属性,可以用removeAttributes删除

Guess you like

Origin blog.csdn.net/qq_16830879/article/details/89495306