Jsoup Java technologies reptiles

Java applications have the impression that enterprise systems development language, in fact, Java is also very strong in terms of reptiles, there are very mature ecosystem, and strong language base, whether crawling processing, data processing can have enough It supports. Early school days, has read a book reptiles, he did not insist on reading it, and now work time is not very adequate, relevant framework, the key technology to do some recording.

A, Jsoup Profile

1. official website

https://jsoup.org

2. Functional Description

In the reptile program, Jsoup as an HTML parser, such as crawling can use HttpClient framework, Jsoup itself also supported the launching of a common request, support for HTTP, HTTPS, etc., but this support is not rich enough, can cope with everyday situations.
Jsoup can be obtained from the text, files, url HTML pages, generate documentation Document object, and provides a method of operation is similar to the Jquery, select elements of CSS selectors to find ways to HTML can be a variety of flexible parsing operations. Familiar with HTML and Jquery experienced developers can get started very fast.

Two, Jsoup practical operation

1. Operation Case

  1. maven dependence
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.11.3</version>
</dependency>
  1. Get HTML parse strings the way
String html = "<html><body>test</body></html>";
Document document = Jsoup.parse(html);
  1. HTML way to get through that initiated the request URL (GET request)
//网络请求一般要设置超时时间,防止程序无限制等待,这种情况在多线程很容易出现阻塞
Document document = Jsoup.connect("https://www.baidu.com")
        .timeout(1000)
        .get();
System.out.println(document.toString());
  1. Initiate POST request (POST request) to the API address
Connection connection = Jsoup.connect("http://192.168.1.1:8080/api")
        .header("Accept", "*/*")
        .header("Accept-Language", "zh-CN,zh;q=0.9")
        .header("Connection", "keep-alive")
        .header("Content-Type", "application/x-www-form-urlencoded")
        .header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36")
        .timeout(3000)
        .method(Connection.Method.POST)
        .ignoreContentType(true);
Map<String, String> params = new HashMap<>();
params.put("param1", "1");
params.put("param2", "2");
connection.requestBody(JSON.toJSONString(params));
connection.execute();
  1. · HTML parsing operation method of operation (similar to Jquery)
Elements elements = document.getElementsByTag("body");

Other get method, similar Jquery, according to know the name of the role, as shown in Fig:
Method list Screenshot
6. The operation of parsing the HTML · CSS selectors

Element element = elements.select("p[align='center']").first();
System.out.println(element.html());

CSS selector may refer W3School CSS selectors reference documentation:
https://www.w3school.com.cn/cssref/css_selectors.ASP

2. Use Jsoup file download (network resources crawling)

Initially when crawling file with a variety of programs, because the files are large, but also a number of problems, such as the use of commons-io package IoUtils file download, multiple threads if you encounter problems or network resources, the reason is very likely to cause thread blocks Therefore think you need to set the timeout, this time thought of Jsoup itself, but Jsoup default only supports 1M files downloaded over the need to set maxBodySize , in addition, the timeout is very important, according to program implementation, network conditions considering setting.
Use Jsoup case are as follows:

Connection.Response response = Jsoup.connect(url)
        .header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36")
        .cookies(cookieMap)
        .maxBodySize(30000000)
        .timeout(60 * 1000)
        .ignoreContentType(true)
        .execute();
byte[] dataArray = response.bodyAsBytes();

ByteArray to get after you can continue to use the method in the commons-io write files directly, very convenient.
Another way, the input buffer stream can be acquired:

BufferedInputStream bufferedInputStream = response.bodyStream();
Published 44 original articles · won praise 62 · views 170 000 +

Guess you like

Origin blog.csdn.net/womeng2009/article/details/104001574