[Java-Crawler] Crawling dynamic pages (WebMagic, Selenium, ChromeDriver)


When I talked about the WebMagic framework in the last article ( I learned the WebMagic crawler framework ), I mentioned that WebMagic can only parse static pages, which cannot meet my crawler needs. Now I want to crawl dynamic pages, I need to crawl JavaScript The page that has been parsed.

1. Resources that need to be downloaded and dependencies introduced

resource

"Don't say more, less is not spicy", first directly give the resources and dependencies needed for this blog.

insert image description hereAccording to GPT's answer: the Google browser version corresponding to 114.0.5735.16 should be 94.0.4606.61, so the Google browser version we downloaded must be 94.0.4606.61the version. Below is its download link.

Link: https://pan.baidu.com/s/1eMnn-phueE5yZgCdoEQOwA?pwd=tk0w
Extraction code: tk0w

There are two ways to download the driver, one is to go to the ChromeDriver official website , the other is to go to the ChromeDriver official download address , we choose the latter, because the latter does not need to consider the version correspondence, which version of ChromeDriver corresponds to the version of Google, and the above People GPT also said that the download of the latter is more stable.

insert image description here
Note: The downloaded driver will have a chromedriver.exe, and then it needs to be placed in the C:\Windows\System32 directory so that it can be found when running the code. Of course, this kind of comparison is limited and generally System.setProperty("webdriver.chrome.driver", chromedriver.exe的路径);specified. Of course, the former is more convenient on this machine.

rely

First of all, Maven is used to create modules for testing, and the required dependencies are given below.

First of all, since you need to use the WebMagic framework, you need to add its related dependencies (one is the core dependency, the other is the extended dependency, they also use a toolkit commons-lang internally, we have to add it).

        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-extension</artifactId>
            <version>0.5.3</version>
        </dependency>

        <dependency>
            <groupId>us.codecraft</groupId>
            <artifactId>webmagic-core</artifactId>
            <version>0.5.3</version>
        </dependency>
        <!--工具包(StringUtils)-->
        <dependency>
            <groupId>commons-lang</groupId>
            <artifactId>commons-lang</artifactId>
            <version>2.6</version>
        </dependency>

Then in order to get the dynamic page code, we have to use Selenium to test, and also need to remotely debug the Google Chrome browser. In order to allow developers to communicate with the Chrome browser through the HTTP protocol, a remote debugging protocol is required: the Chrome DevTools protocol, and selenium-devtools- The v86 dependency provides integration of this protocol. So you need to add the following dependencies ( selenium-devtools-v86 is a dependency of selenium-java, the version number must be consistent, otherwise it will not work ):

        <!--非常重要-->->
        <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-devtools-v86</artifactId>
            <version>4.0.0-beta-2</version>
        </dependency>

        <!--版本和devtools一致-->
        <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-java</artifactId>
            <version>4.0.0-beta-2</version>
        </dependency>

Finally, we need a powerful Java tool library, guava (Karwa).

        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>31.1-jre</version>
        </dependency>

At this point, the resources and dependencies needed are all in place, and the next step is actual combat.

2. Actual code

  1. Spider (spider) container configuration, management, open;
  2. Create ChromeDrivera driver object, which can be transformed up to WebDriverand as needed JavascriptExecutor;
  3. Simulate to open the corresponding web page, you can page.getUrl.toString()get the url string through ;
  4. Then use the ChromeDriver object to obtain the corresponding WebElement object;
  5. Then webElement.getAttribute("outerHTML")construct the Html object by obtaining the original html string, and the rest is the same operation as the WebMagic framework.
    ...
    close and exit ChromeDriver.
public class CompanyProcessor implements PageProcessor {
    
    


    private Site site = Site.me()
    .setRetryTimes(3)
    .setRetrySleepTime(3000)
    .setSleepTime(1000)
    .setTimeOut(3000);

    public void process(Page page) {
    
    

        // 创建ChromeDriver实例对象
        ChromeDriver driver = new ChromeDriver();
        // 去模拟浏览器输入url后敲回车
        driver.get(page.getUrl().toString());
        try {
    
    
            Thread.sleep(1000);
        } catch (InterruptedException e) {
    
    
            e.printStackTrace();
        }
        // 获取body下的标签内容
        WebElement webElement = driver.findElement(By.tagName("body"));
        // 模拟点击事件,因为有的时候不通过一些外设操作有些html代码是不会出现的,就爬不了了
        WebElement element = webElement.findElement(By.cssSelector("span[event-type='15']"));
        element.click();
		
		// 不知道是不是多线程run的原因,这里点击完要等一会儿,不然后面获取点击后的代码没有		
		try {
    
    
            Thread.sleep(2000);
        } catch (InterruptedException e) {
    
    
            throw new RuntimeException(e);
        }
			
        // 获取其body下的原始html字符串,只含指定webElement下的内容
        String str = webElement.getAttribute("outerHTML");
        
        // 将上面得出来的字符串转换成Html对象
        // 其构造生成的是通过 Jsoup 解析对Html对象内部属性document进行初始化的
        Html html = new Html(str);
        System.out.println(html.xpath("//tbody/tr").all());
        // 关闭驱动,退出驱动
        driver.close();
        driver.quit();

    }

    public Site getSite() {
    
    
        return site;
    }

    public static void main(String[] args) {
    
    
        Spider.create(new CompanyProcessor())
                .addUrl("https://we.51job.com/pc/search?keyword=java&searchType=2&sortType=0&metro=")
                .thread(5)
                .run();
    }
}

Test effect

Job links and job-related information are already available.

Please add a picture description

Guess you like

Origin blog.csdn.net/qq_63691275/article/details/130839969
Recommended