[Java crawler-HtmlUnit learning summary]

Environment construction

1 Maven依赖
2     <dependency>
3     <groupId>net.sourceforge.htmlunit</groupId>
4     <artifactId>htmlunit</artifactId>
5     <version>2.15</version>
6     </dependency>

1. Basic use

1    final WebClient webClient=new WebClient();//Create object
 2    final HtmlPage page=webClient.getPage("https://www.baidu.com");//Get page
 3    System.out.println(page.asText ());//asText() As the name implies, get all text
 4     webClient.closeAllWindows();//Close the window
1   List<HtmlAnchor> achList=page.getAnchors();
2       for(HtmlAnchor ach:achList){
3       System.out.println(ach.getHrefAttribute());
4    }
1.HtmlUnit 's support for Javascript is not very good 
2.HtmlUnit's support for CSS is not very good so let's modify it,
1   final WebClient webClient=new WebClient();
2       webClient.getOptions().setCssEnabled(false);//关闭css
3       webClient.getOptions().setJavaScriptEnabled(false);//关闭js
4    final HtmlPage page=webClient.getPage("https://www.baidu.com");
5    System.out.println(page.asText());
6    webClient.closeAllWindows();
 

1.1 Emulate a specific browser

1  // Simulate the chorme browser, please modify the BrowserVersion.xxx constants for other browsers 
2 WebClient webClient= new WebClient(BrowserVersion.CHROME);

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325232503&siteId=291194637