JAVA web crawler 02-Jsoup parses the crawling results

jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very labor-saving API, which can retrieve and manipulate data through DOM, CSS, and operation methods similar to jQuery. The main functions of jsoup are as follows:

  1. Parse HTML from a URL, file or string;
  2. Use DOM or CSS selectors to find and retrieve data;
  3. Operable HTML elements, attributes, and text;

Environmental preparation

Introduce maven dependency

<!--Jsoup-->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.10.3</version>
</dependency>
<!--测试-->
<dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.12</version>
</dependency>
<!--工具-->
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.7</version>
</dependency>
<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.6</version>
</dependency>

Prepare the file dom.html, which is used to test the content as follows:

<html>
 <head> 
  <title>传智播客官网-一样的教育,不一样的品质</title> 
 </head> 
 <body>
	<div class="city">
		<h3 id="city_bj">北京中心</h3>
		<fb:img src="/2018czgw/images/slogan.jpg" class="slogan"/>
		<div class="city_in">
			<div class="city_con" style="display: none;">
				<ul>
					<li id="test" class="class_a class_b">
						<a href="http://www.itcast.cn" target="_blank">
							<span class="s_name">北京</span>
						</a>
					</li>
					<li>
						<a href="http://sh.itcast.cn" target="_blank">
							<span class="s_name">上海</span>
						</a>
					</li>
					<li>
						<a href="http://gz.itcast.cn" target="_blank">
							<span abc="123" class="s_name">广州</span>
						</a>
					</li>
					<ul>
						<li>天津</li>
					</ul>					
				</ul>
			</div>
		</div>
	</div>
 </body>
</html>

Jsoup create document

Create document from url

Jsoup can directly enter the url, it will initiate a request and get the data, encapsulated as a Document object, as follows:

public static void urlDomTests() throws Exception {
    //    解析url地址
    Document document = Jsoup.parse(new URL("http://www.itcast.cn/"), 1000);
    //获取title的内容
    Element title = document.getElementsByTag("title").first();
    System.out.println(title.text());
}

Although Jsoup can be used instead of HttpClient to directly initiate requests to parse data, it is often not used, because in the actual development process, multithreading, connection pooling, proxying, etc. are needed, and jsoup does not support these very well. , So we generally use jsoup only as an Html parsing tool

Create documents from strings and files

Jsoup can directly input a string and encapsulate it as a Document object, as follows:

public static void fileStrTest() throws Exception {
    String html = FileUtils.readFileToString(new File("D:\\works\\ruoyi\\myspider\\src\\main\\java\\test\\dom.html"), "UTF-8");
    //    解析字符串
    Document document = Jsoup.parse(html);
    //jsoUP 可以直接从 url  文件、输入流等内容中直接获取到 document对象
    document =  Jsoup.parse(new File("D:\\works\\ruoyi\\myspider\\src\\main\\java\\test\\dom.html"), "UTF-8");
    //获取title的内容
    Element title = document.getElementsByTag("title").first();
    System.out.println(title.text());

}

Parse the document

Element acquisition

  1. Query element getElementById according to id
  2. Get elements by tag getElementsByTag
  3. Get elements according to class getElementsByClass
  4. Get elements by attribute getElementsByAttribute

Get data from the element

  1. Get id from element
  2. Get the className from the element
  3. Get the value of the attribute from the element attr
  4. Get all attributes from the element
  5. Get the text content text from the element

Use selector syntax to find elements

Selector

tagname: 通过标签查找元素,比如:span
#id: 通过ID查找元素,比如:# city_bj
.class: 通过class名称查找元素,比如:.class_a
[attribute]: 利用属性查找元素,比如:[abc]
[attr=value]: 利用属性值来查找元素,比如:[class=s_name]

Selector combination

el#id: 元素+ID,比如: h3#city_bj
el.class: 元素+class,比如: li.class_a
el[attr]: 元素+属性名,比如: span[abc]
任意组合: 比如:span[abc].s_name
ancestor child: 查找某个元素下子元素,比如:.city_con li 查找"city_con"下的所有li
parent > child: 查找某个父元素下的直接子元素,比如:
.city_con > ul > li 查找city_con第一级(直接子元素)的ul,再找所有ul下的第一级li
parent > *: 查找某个父元素下所有直接子元素

Test code

public static void documentOper() throws Exception {

    //jsoUP 可以直接从 url  文件、输入流等内容中直接获取到 document对象
    Document document =  Jsoup.parse(new File("D:\\works\\ruoyi\\myspider\\src\\main\\java\\test\\dom.html"), "UTF-8");
   //根据id 获取元素
    Element ele1 = document.getElementById("test");
    System.out.println( "==============:"+ele1.toString());
    //可以获取元素的id 、class 、属性、内容文本等
    System.out.println( "==============:"+ele1.id());
    System.out.println( "==============:"+ele1.className());
    System.out.println( "==============:"+ele1.attr("id"));
    System.out.println( "==============:"+ele1.text());


    //根据元素名获取元素
    Element ele2 = document.getElementsByTag("title").first();
    System.out.println("==============:"+ ele2.toString());
    //根据属性名获取元素
    Element ele3 = document.getElementsByAttribute("href").first();
    System.out.println( "==============:"+ele3.toString());
    //根据属性名和属性值获取元素
    Element ele5 = document.getElementsByAttributeValue("abc","123").first();
    System.out.println( "==============:"+ele5.toString());
    //根据演示名获取元素
    Element ele4 = document.getElementsByClass("city").first();
    System.out.println( "==============:"+ele4.toString());

    //selector 选择器查找元素
    /*
     tagname: 通过标签查找元素,比如:span
    #id: 通过ID查找元素,比如:# city_bj
    .class: 通过class名称查找元素,比如:.class_a
    [attribute]: 利用属性查找元素,比如:[abc]
    [attr=value]: 利用属性值来查找元素,比如:[class=s_name]
    * */
    Elements span = document.select("span");
    for(Element ele:span){
        System.out.println(ele.text());
    }

    System.out.println("============>#city_bj:"+document.select("#city_bj").text());
    System.out.println("============>.class_a:"+document.select(".class_a").text());
    System.out.println("============>[abc]:"+document.select("[abc]").text());
    System.out.println("============>[class=s_name]:"+document.select("[class=s_name]").text());

    /*
    * 选择器组合使用
    * el#id: 元素+ID,比如: h3#city_bj
        el.class: 元素+class,比如: li.class_a
        el[attr]: 元素+属性名,比如: span[abc]
        任意组合: 比如:span[abc].s_name
        ancestor child: 查找某个元素下子元素,比如:.city_con li 查找"city_con"下的所有li
        parent > child: 查找某个父元素下的直接子元素,比如:
        .city_con > ul > li 查找city_con第一级(直接子元素)的ul,再找所有ul下的第一级li
        parent > *: 查找某个父元素下所有直接子元素
    */
    System.out.println("============>h3#city_bj:"+document.select("h3#city_bj").text());
    System.out.println("============>span[abc]:"+document.select("span[abc]").text());
    System.out.println("============>span[abc].s_name:"+document.select("span[abc].s_name").text());
    System.out.println("============>.city_con > ul > li:"+document.select(".city_con > ul > li").text());
    System.out.println("============>.city_con li:"+document.select(".city_con > ul > li").text());

}

Guess you like

Origin blog.csdn.net/zhangxm_qz/article/details/109444528