Jsoup course:

URL limit of postagent operand 1. Introduction

Jsoup is an HTML analyzer that can directly analyze URL addresses? HTML text content. You can also use DQ, CSS and jQuery-like methods to obtain and process data. Its main function.

1. Clear HTML from URL, string or text

2. Find and retrieve data

3. Handle html elements, attribute: text.

Jsoup directly inherits the public object Jsoupextends declared by the Object class

This is the basis for public access using the Jsoup library.

Second, the details of the method

1. Public static Documentparse (string html, string baseUri) minimizes html in the document, where you can create any document tree for any HTML.

Among them, the baseUri and url of html are usually expressed in the form of relative road strength. BaseUri is used to adjust the strength of its root path. This is especially important when analyzing URLs in html (from relative traffic to absolute traffic).

2. Public static Documentparse (string html, string baseUri, parser parser) uses the specified parser to analyze html lines.

3. Analysis of static lines of social static document analysis (string html) html in the document. BaseUri is not listed here, it depends on html \\ lt ;. Basic href \ ugt26 tags:

4. Public static Connectionconnect (string url) creates a contact object with a specified url, which is usually used to retrieve or analyze html pages.

For example, the document doc = Jsoup.connect ("http://example.com") .userAgent ("Mozilla") .data ("name", "jsoup") Get ()

File doc = Jsoup.connect ("http://example.com"). Cookie ("auth", "token"). release();

5. Public static Documentparse (file input, string charsetName, string baseUri) throws IOException to analyze HTML files

charsetName refers to encryption, usually UTF-8 is more secure. When the file cannot be found or the file is unreadable or the encryption is invalid, it will run, except IO

Internet price picture .

6. Public static Documentparse (file input, string charsetName) throws IOException cycle HTML file, the location of this file is usually used as baseUri. The rest is the same as 5 above.

7. Public static Documentparse (InputStream input, String charsetName, String baseUri) throws IOException, read the input stream, and then analyze it in the Document object.

8. Public static Documentparse (InputStream input, String charsetName, String baseUri, Parser parser) throws IOException to read the input stream using the specified analyzer to analyze it.

9. The public static DocumentparseBodyFragment (string bodyHtml, string baseUri) analyzes the html part, which contains only a part of the body. BaseUri is listed

10. The public static DocumentparseBodyFragment (string bodyHtml) analyzes the html part, which contains only a part of the body. BaseUri is not specified

11. Public static Documentparse (URL url, int timeoutMillis) discards the html tag specified in the IOException url in the document. Instead, treat these as actions that you must perform regularly.

If the return code is not 200, or an incorrect reading error will cause an IO exception.

12. Public static Stringclean (string bodyHtml, string baseUri, white pointer white pointer) uses whitelist tags? Attributes to filter inbound html to access html safely. BaseUri is listed

13. Public static Stringclean (string bodyHtml, white indicator white indicator) uses white indicator and attribute filtering to disallow html filter to get safe html. BaseUri is not specified

14. Public static booleanisValid (string bodyHtml, whitelist whitelist) checks whether the input html contains only allowed tags: attributes. Jsoup class Postagent opera URL address cepfragment. JS

Guess you like

Origin www.cnblogs.com/blogst/p/12671120.html