[Java crawler] Jsoup

Official website
Chinese manual

jsoup.jar official website download
jsoup.jar Baidu network disk download Extraction code: g6ek

jsoup is a Java HTML parser, mainly used to parse HTML, and can directly parse a URL address and HTML text content. It provides a very labor-saving API, which can retrieve and manipulate data through DOM, CSS, and operation methods similar to jQuery.

从一个URL,文件或字符串中解析HTML;
使用DOM或CSS选择器来查找、取出数据;
可操作HTML元素、属性、文本;

The main class of Jsoup


org.jsoup.Jsoup 类

The Jsoup class is the entry point of any Jsoup program and will provide methods to load and parse HTML documents from various sources. Some important methods of the Jsoup class are as follows:
Insert picture description here


org.jsoup.nodes.Document类:

This class represents loading HTML documents through the Jsoup library . You can use this class to perform operations that apply to the entire HTML document. For the important methods of the Document class, see: http://jsoup.org/apidocs/org/jsoup/nodes/Document.html , the common methods of the Document class are as follows:
Insert picture description here

org.jsoup.nodes.Element类:

HTML elements are composed of tag names, attributes and child nodes . Use the Element class to extract data, and the important methods of traversing nodes and operating the HTMLElement class can be found at: http://jsoup.org/apidocs/org/jsoup/nodes/Element.html , the common methods of the Element class are as follows:
Insert picture description here

Ready to work

Create a Jsoup project, the directory structure is as follows:
Insert picture description here
to add the jar package of jsoup, put it in the Baidu network disk, the version is:jsoup-1.11.3

Crawling content case


Crawl web content

Load document from URL, use Jsoup.connect()method to load from URL

The page information to be crawled is as follows: The
Insert picture description here


code is:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

/**
 * 获取 http://www.ygdy8.net/html/gndy/index.html 页面的“国内电影下载排行“
 */
public class JsoupDemo {
    
    
    public static void main(String[] args) {
    
    
        //URLLoader 是一个静态方法,可以通过类名直接调用
        JsoupDemo.URLLoader("http://www.ygdy8.net/html/gndy/index.html");
    }

    public static void URLLoader(String url){
    
    
        Document d1;

        try {
    
    
            //通过Jsoup类中的静态方法connect返回Document对象,该document对象实际为整个html页面内容。
            d1 = Jsoup.connect(url).get();
            System.out.println("网页的标题是:" + d1.title() + "\n");

            /**
             * 1. 分析网页结构,发现我们想要的国内电影下载排行榜,所对应的class样式为co_content2,
             * 2. 可以通过属性、后代选择器选中元素" div[class=‘co_content2’] ul a" ,然
             * 3. 后通过Element类中的text()方法获取文本内容
             */
            Elements es = d1.select("div[class='co_content2'] ul a");

            //遍历得到的结果并输出内容
            for (Element e : es){
    
    
                System.out.println(e.text());
            }
        } catch (IOException e) {
    
    
            e.printStackTrace();
        }
    }
}

Crawl document content

Load document from file, use Jsoup.parse()method to load HTML from file

The page to be crawled is as follows, which is an html document in my local path
Insert picture description here

The code is:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.File;
import java.io.IOException;

/**
 * 从本地文件加载文档,获取淘宝主营区域包括哪些方面 文件路径为: C:\Users\vsue\Desktop\taobao.html
 */
public class JsoupDocDemo {
    
    
    public static void main(String[] args) {
    
    
        JsoupDocDemo.LocLoader("C:\\Users\\vsue\\Desktop\\taobao.html");
    }

    public static void LocLoader(String address) {
    
    
        Document d2;
        try {
    
    
            //从文件加载文档,使用`Jsoup.parse()`方法从文件加载HTML
            d2 = Jsoup.parse(new File(address), "utf-8");
            System.out.println(d2.title());

            Elements es = d2.select("ul[class='service-bd'] li a");

            for (Element e : es){
    
    
                System.out.println(e.text());
            }
        } catch (IOException e) {
    
    
            e.printStackTrace();
        }

    }
}

The result is:
Insert picture description here


Crawl the content of String

Load document from String, use Jsoup.parse() method to load HTML from string

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class JsoupStrDemo {
    
    
    public static void main(String[] args) {
    
    
        String html = "<html>"
                + "<head>"
                + "<title>First parse</title>"
                + "</head>"
                + "<body>"
                + "<p>Parsed HTML into a doc.</p>"
                + "<a href='http://www.baidu.com'>百度一下</a>"
                + "</body>"
                + "</html>";
        JsoupStrDemo.StringLoader(html);
    }
    public static void StringLoader(String html){
    
    
        //从String加载文档,使用Jsoup.parse()方法从字符串加载HTML
        Document d3 = Jsoup.parse(html);
        String url = d3.select("a").attr("href");
        System.out.println(d3.title()+"    "+url);
    }
}

result:
Insert picture description here

Get all links on the page

A page often contains a large number of URLs, jump to different interfaces, and then define the method to obtain all the URLs of the main page of Jingdong

Crawl all the <a>tags in the page , and then traverse

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

/**
 * 获取HTML页面中的所有链接
 */
public class JsoupAllUrlDemo {
    
    
    public static void main(String[] args) {
    
    
        JsoupAllUrlDemo.allUrlLoader("https://www.jd.com/");
    }
    public static void allUrlLoader(String address){
    
    

        Document d4;
        try {
    
    
            d4 = Jsoup.connect(address).get();

            //links包含了页面所有的连接
            Elements links = d4.select("a[href]");
            for (Element link : links) {
    
    
                System.out.println("text : " + link.text()+"---》link : " + link.attr("href"));
            }
        } catch (IOException e) {
    
    
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

Parse a body fragment

Method
Use Jsoup.parseBodyFragment(String html) method.

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

The description
parseBodyFragmentmethod creates an empty document and inserts the parsed HTML into the body element. If you are using the normal Jsoup.parse(String html)method, generally you can get the same results, but explicitly input by the user as the body segment, so as to ensure that any bad HTML provided by the user will be resolved into the body element.

Document.body()The method can achieve all child elements in the document body element, and doc.getElementsByTag("body")the same.

Data extraction

DOM method to traverse the document

After parsing HTML into a Document, you can use methods similar to DOM to operate. Sample code:
Insert picture description here

Explain that the
Elements object provides a series of methods similar to DOM to find elements, extract and process the data. details as follows:

Find element

  • getElementById(String id)
  • getElementsByTag(String tag)
  • getElementsByClass(String className)
  • getElementsByAttribute(String key) (and related methods)
  • Element siblings: siblingElements(), firstElementSibling(), lastElementSibling(); nextElementSibling(), previousElementSibling()
  • Graph: parent(), children(), child(int index)

Element data

  • attr(String key) get attributes attr(String key, String value) set attributes
    attributes() get all attributes
  • id(), className() and classNames()
  • text() Get text content text(String value) Set text content
  • html() Get the HTML in the element html (String value) Set the HTML content in the element
  • outerHtml() Get the HTML content outside the element
  • data() Get data content (for example: script and style tags)
  • tag() and tagName()

Manipulate HTML and text

  • append(String html), prepend(String html)
  • appendText(String text), prependText(String text)
  • appendElement(String tagName), prependElement(String tagName)
  • html(String value)

Use selector syntax to find elements

The method
can be used Element.select(String selector)and Elements.select(String selector)methods to find and implement operational elements:

Insert picture description here

The jsoup elements object supports selector syntax similar to CSS (or jquery) to achieve a very powerful and flexible search function. .

This select method can be used in Document , Element , or Elements objects. And it is context-sensitive, so it can achieve filtering of specified elements or chain selection access.

selectThe method will return a Elementscollection and provide a set of methods to extract and process the results.


Selector overview

  • tagname: Find elements by tags, such as:a
  • ns|tag: Find elements in a namespace through the label, for example: You can use fb | name syntax for the <fb:name>elements
  • #id: Find elements by ID, such as:#logo
  • .class: Find elements by class name, such as:.masthead
  • [attribute]: Use attributes to find elements, such as:[href]
  • [^attr]: Use the attribute name prefix to find elements, for example: you can use [^data-]to find elements with HTML5 Dataset attributes
  • [attr=value]: Use attribute values ​​to find elements, such as:[width=500]
  • [attr^=value], [attr$=value], [attr*=value]: Use matching attribute values ​​at the beginning, end, or containing attribute values ​​to find elements, such as:[href*=/path/]
  • [attr~=regex]: Use attribute values ​​to match regular expressions to find elements, such as: img[src~=(?i)\.(png|jpe?g)]
  • *: This symbol will match all elements

Selector selector combination use

  • el#id: Element + ID, such as: div#logo
  • el.class: Element + class, for example: div.masthead
  • el[attr]: Element + class, for example: a[href]
  • 任意组合, Such as: a[href].highlight
  • ancestor child: Find the child elements of an element, for example: you can use .body p to find all p elements under the "body" element
  • parent > child: Find the direct child elements under a parent element, for example: you can use div.content> p to find the p element, or you can use body> * to find all the direct child elements under the body tag
  • siblingA + siblingB: Find the first sibling element B before the A element, for example: div.head + div
  • siblingA ~ siblingX: Find the X element of the same level before the A element, for example: h1 ~ p
  • el, el, el: Multiple selector combination, find the only element that matches any selector, for example: div.masthead, div.logo

Pseudo selectorselectors

  • :lt(n): Find which element's sibling index value (its position in the DOM tree is relative to its parent node) is less than n, for example: td:lt(3) means an element less than three columns
  • :gt(n): Find which elements have the same index value greater than n, for example: div p:gt(2) indicates which div contains more than 2 p elements
  • :eq(n): Find which elements have the same index value as n, for example: form input: eq(1) means a Form element that contains an input tag
  • :has(seletor): Find elements that match the element contained in the selector, for example: div:has§ indicates which div contains the p element
  • :not(selector): Find elements that do not match the selector, for example: div:not(.logo) means a list of all divs that do not contain class=logo elements
  • :contains(text): Find elements that contain the given text, the search does not distinguish between uppercase and lowercase, for example: p:contains(jsoup)
  • :containsOwn(text): Find elements that directly contain the given text
  • :matches(regex): Find which elements of the text match the specified regular expression, such as: div:matches((?i)login)
  • :matchesOwn(regex): Find elements that contain text that matches the specified regular expression

Note: The above pseudo-selector index starts from 0, which means that the index value of the first element is 0, the index of the second element is 1, etc.
You can check the Selector API reference for more details



Extract attributes, text and HTML from elements

After parsing to obtain a Document instance object and finding some elements, you want to get the data in these elements.

Method :

  • To get the value of an attribute, you can use the Node.attr(String key)method
  • For the text in an element, you can use the Element.text()method
  • To get to the element or attribute in HTML content, you can use Element.html(), or Node.outerHtml()method
    Insert picture description here

The above method is the core method of element data access. In addition, some other methods can be used:

  • Element.id()
  • Element.tagName()
  • Element.className()And Element.hasClass(String className)
    these accessor methods have corresponding setter methods to change the data.


change the data

problem

在你解析一个Document之后可能想修改其中的某些属性值,
然后再保存到磁盘或都输出到前台页面。

Method
Property setting method can be used Element.attr(String key, String value), and Elements.attr(String key, String value).

If you need to modify the class attribute of an element can be used Element.addClass(String className)and Element.removeClass(String className)methods.

Elements provides methods to manipulate element attributes and classes in batches. For example, to add a rel="nofollow" to each a element in the div, you can use the following methods:

doc.select("div.comments a").attr("rel", "nofollow");

Explanation
Like other methods in Element, the attr method also returns the Element (or returns the Elements collection when using the selector) . This can be very convenient to use the method of writing. such as:

doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");


Set the HTML content of an element

Insert picture description here
Description

  • Element.html(String html) This method will first clear the HTML content in the element and then replace it with the passed HTML.
  • Element.prepend(String first)And a Element.append(String last)method for adding HTML content, respectively before and after the internal elements of HTML
  • Element.wrap(String around) Wrap an external HTML content to the element.

Set the text content of the element

Method
You can use the setting method of Element to modify the text content in an HTML document:
Insert picture description here

Note The
text setting method is the same as the HTML setter method:

  • Element.text(String text) Will clear the internal HTML content in an element, and then replace the provided text
  • Element.prepend(String first)And Element.append(String last)the nodes are added before and after the text inside the html element.

If the incoming text contains characters like <, >, etc., it will be processed as text instead of HTML.

Guess you like

Origin blog.csdn.net/weixin_45468845/article/details/108563904