Java crawler (Jsoup and WebDriver)

1. Jsoup Reptile

jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a set of very labor-saving APIs that can retrieve and manipulate data through DOM, CSS, and jQuery-like operations.

Take the blog park homepage as an example

1. Idea new maven project

pom.xml import jsoup dependencies

<dependency>
      <groupId>org.jsoup</groupId>
      <artifactId>jsoup</artifactId>
      <version>1.12.1</version>
</dependency>

jsoup code

package com.blb; 

import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select .Elements; 

import java.io.IOException; 

public class jsoup { 

    public static void main (String [] args) { 
 // Blog garden home page url  final String url = "https://www.cnblogs.com" ;  try {// The html tag page of the entire page is obtained first Document doc = Jsoup.connect (url) .get (); System.out .println (doc); // You can get specific elements in HTML through the tags of the elements. Elements title = doc.select ("title" ); String t = title. text (); System.out.println (t); // You can get specific elements in html by element id Element site_nav_top = doc.getElementById ("site_nav_top" ); String s = site_nav_top.text (); System.out .println (s); 
} catch (IOException e) {e.printStackTrace ();}}}

This method has a great limitation, because the jsoup crawler is only suitable for crawling static web pages, so it can only crawl the information of the current page.

2. Webdriver technology

Selenium is a browser automation operation framework. Selenium is mainly composed of three tools.
1. The first tool-SeleniumIDE, is an extension plug-in for Firefox that supports user recording and return visit testing. The recording / revisit mode has limitations and is not suitable for many users.

2. Therefore, the second tool, Selenium WebDriver, provides APIs for various language environments to support more control and write applications that conform to standard software development practices.

3. The last tool, SeleniumGrid, helps engineers use the Selenium API to control browser instances distributed on a series of machines, and supports running more tests concurrently.

Inside the project, they are called "IDE", "WebDriver" and "Grid" respectively.

What is Webdriver?
Official website introduction:
WebDriver is a clean, fast framework for automated testing of webapps. (WebDriver is a clean and fast web application automatic testing framework.)

WebDriver is developed for each browser and replaces the JavaScript embedded in the web application under test. Tight integration with the browser supports the creation of more advanced tests, avoiding the limitations caused by the JavaScript security model. In addition to support from browser vendors, WebDriver also uses operating system-level calls to simulate user input. WebDriver supports Firefox (FirefoxDriver), IE (InternetExplorerDriver), Opera (OperaDriver) and Chrome (ChromeDriver). It also supports mobile application testing for Android (AndroidDriver) and iPhone (IPhoneDriver). It also includes an interface-free implementation based on HtmlUnit, called HtmlUnitDriver. The WebDriver API can be accessed through Python, Ruby, Java, and C #, allowing developers to create tests using their preferred programming language.

How WebDriver works

WebDriver is a W3C standard, hosted by Selenium.

The specific protocol standard can be viewed from http://code.google.com/p/selenium/wiki/JsonWireProtocol#Command_Reference.

From this protocol, we can see that the reason why WebDriver can interact with the browser is because the browser implements these protocols. This protocol uses JOSN to transmit via HTTP.

Its implementation uses the classic Client-Server model. The client sends a requset, and the server returns a response.

We clarify several concepts.

Client

The machine that calls WebDriverAPI.

Server

The machine running the browser. The Firefox browser directly implements the WebDriver communication protocol, while Chrome and IE are implemented through ChromeDriver and InternetExplorerDriver.

Session

The server needs to maintain the browser's session. The request header sent from the client contains session information, and the server will execute the corresponding browser page.

WebElement

This is an object in the WebDriverAPI and represents a DOM element on the page.

achieve:

1. Download the browser driver, using the Chrome browser, download address http://chromedriver.storage.googleapis.com/index.html , according to the corresponding browser version number to download

2. Idea new maven project

import pom.xml into selinium dependencies

 <dependency>
      <groupId>org.seleniumhq.selenium</groupId>
      <artifactId>selenium-java</artifactId>
      <version>3.141.59</version>
 </dependency>

Code:

package com.blb;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;

public class chrome {

    public static void main(String[] args) {
        //下载的chromedriver位置
        System.setProperty("webdriver.chrome.driver","D:\\idea_workspace\\Jsoup\\src\\main\\chromedriver.exe");
        //实例化ChromeDriver对象
        WebDriver driver = newChromeDriver (); String url = "https://search.51job.com/list/000000,000000,0000,00,9,99,%25E5%25A4%25A7%25E6%2595%25B0%25E6%258D%25AE % 25E5% 25B7% 25A5% 25E7% 25A8% 258B% 25E5% 25B8% 2588,2,1.html? Lang = c & stype = & postchannel = 0000 & workyear = 99 & cotype = 99 & degreefrom = 99 & jobterm = 99 & companysize = 99 & providesalary = 99 & lonlat = 0% 2C0 & radius =- 1 & ord_field = 0 & confirmdate = 9 & fromType = & dibiaoid = 0 & address = & line = & specialarea = 00 & from = & welfare = " ; // Open the specified website driver.get (url); // parse page String pageSource = driver.getPageSource (); Document jsoup = Jsoup. parse (pageSource); // Define selector rule String rule = "# resultList> div: nth-child (4)> p> span> a" ; // Get the element through the selector Elements select = jsoup.select (rule ); String s = select .text (); System.out .println (s); // Simulate browser click driver.findElement(By.cssSelector(rule)).click(); } }

Crawl movie resources:

package com.blb;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;


public class getMovie {

    private static final String url="http://www.zuidazy5.com";

    public static void main(String[] args) {
        System.setProperty("webdriver.chrome.driver","D:\\idea_workspace\\Jsoup\\src\\main\\chromedriver.exe");
        WebDriver driver = new ChromeDriver();
        driver.get(url);
        String pageSource = driver.getPageSource();
        Document jsoup = Jsoup.parse(pageSource);
        String rule1 = "body> div.xing_vb> ul> li> span.xing_vb4> a"; 
        Elements select = jsoup.select (rule1); 
        // Traverse all movie details entry of the current page 
        for (Element e: select) 
        { 
            / / Get link to movie detail page 
            String href = e.attr ("href"); 
            // Enter each movie detail page 
            driver.get (url + href); 
            String pageSource2 = driver.getPageSource (); 
            Document jsoup2 = Jsoup.parse (pageSource2); 

            // Define the rules for getting movie information elements 
            String mname = "body> div.warp> div: nth-child (1)> div> div> div.vodInfo> div.vodh> h2"; 
            String mpic = "body> div.warp> div: nth-child (1)> div> div> div.vodImg > img";
            String mdirector="body > div.warp > div:nth-child(1) > div > div > div.vodInfo > div.vodinfobox > ul > li:nth-child(2) > span";
            String mactor="body > div.warp > div:nth-child(1) > div > div > div.vodInfo > div.vodinfobox > ul > li:nth-child(3) > span";
            String marea="body > div.warp > div:nth-child(1) > div > div > div.vodInfo > div.vodinfobox > ul > li:nth-child(5) > span";
            String mlanguage="body > div.warp > div:nth-child(1) > div > div > div.vodInfo > div.vodinfobox > ul > li:nth-child(6) > span";
            String mshowtime="body > div.warp > div:nth-child(1) > div > div > div.vodInfo > div.vodinfobox > ul > li:nth-child(7) > span";
            String mscore="body > div.warp > div:nth-child(1) > div > div > div.vodInfo > div.vodh > label";
            String mtimelength="body > div.warp > div:nth-child(1) > div > div > div.vodInfo > div.vodinfobox > ul > li:nth-child(8) > span";
            String mlastmodifytime="body > div.warp > div:nth-child(1) > div > div > div.vodInfo > div.vodinfobox > ul > li:nth-child(9) > span";
            String minfo="body > div.warp > div:nth-child(1) > div > div > div.vodInfo > div.vodinfobox > ul > li.cont > div > span.more";
            String mplayaddress1="#play_1 > ul > li";
            String mplayaddress2="#play_2 > ul > li";
            String msv="body > div.warp > div:nth-child(1) > div > div > div.vodInfo > div.vodh > span";

            //获取元素信息
            String sv=jsoup2.select(msv).text();
            String name = jsoup2.select(mname).text();
            String pic = jsoup2.select(mpic).attr("src");
            String director=jsoup2.select(mdirector).text();
            String actor=jsoup2.select(mactor).text();
            String area=jsoup2.select(marea).text();
            String language=jsoup2.select(mlanguage).text();
            String showtime=jsoup2.select(mshowtime).text();
            String score=jsoup2.select(mscore).text();
            String timelength=jsoup2.select(mtimelength).text();
            String lastmodifytime=jsoup2.select(mlastmodifytime).text();
            String info=jsoup2.select(minfo).text();
            String playaddress1 = jsoup2.select(mplayaddress1).text();
            String playaddress2=jsoup2.select(mplayaddress2).text();

            //打印电影名
            System.out.println(name);
        }
    }
}

In order not to show the browser crawling process, you can replace chromedriver.exe with a headless browser phantomjs.exe

Download address: https://phantomjs.org/download.html