[Jsoup] Use HtmlUnit + Jsoup to parse js dynamically generated web pages

Copyright:   bluetata   [email protected]
Address of this article:   http://blog.csdn.net/dietime1943/article/details/79035779
Please indicate the source/author for reprints

In the process of using Jsoup, you will encounter the inability to parse the html webpage dynamically generated by javascript. This question is often asked in the Jsoup exchange group. The solution in this article is to use HtmlUnit+Jsoup to parse the dynamic dynamic webpage.

Generally, the source code of the webpage loaded by dynamic js is, some key data, such as the year of a resume website, salary and other sensitive data information, as well as news, announcements dynamically loaded from some websites, and some websites are generating The update time at the time of the snapshot, these may be loaded after the DOM element dynamically generated by js.

Jsoup focuses on parsing html, using API similar to JQuery to quickly parse html, and does not position itself on the simulated browser, so it is not what some people say at this stage that Jsoup is insufficient for simulated browsers, etc. I hope everyone Indicate when explaining to colleagues or writing a blog. For the simulated browser, there are many excellent open source frameworks for windowless access, such as HttpClient, HtmlUnit introduced today, or Selenium.

" As of  now, Jsoup-1.10.4 cannot dynamically load js content. The official author Jonathan Hedley replied: "Javascript is not supported. Jsoup parses HTML.". " HtmlUnit and Selenium are both open source frameworks for testing. , Selenium itself also uses HtmlUnit at the bottom, so this article directly introduces the use of HtmlUnit to deal with the Html problem of dynamic js loading.


If you use Maven to build the project, please refer to the following configuration. If you import the jar package directly, download it directly from the official website or add a blog post to the Jsoup exchange group. You can download the jar package and API help documentation in the group file.

[html]  view plain copy  
  1. <!-- https://mvnrepository.com/artifact/net.sourceforge.htmlunit/htmlunit -->  
  2. <dependency>  
  3.     <groupId>net.sourceforge.htmlunit</groupId>  
  4.     <artifactId>htmlunit</artifactId>  
  5.     <version>2.29</version>  
  6. </dependency>  
Sample code:
[java]  view plain copy  
  1.     publicstaticvoid main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {    
  2.           
  3.         // Block system logs such as HtmlUnit  
  4.         LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log","org.apache.commons.logging.impl.NoOpLog");  
  5.         java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);  
  6.         java.util.logging.Logger.getLogger("org.apache.http.client").setLevel(Level.OFF);  
  7.   
  8.         String url = "https://bluetata.com/";  
  9.         System.out.println("Loading page now-----------------------------------------------: "+url);  
  10.           
  11.         // HtmlUnit 模拟浏览器  
  12.         WebClient webClient = new WebClient(BrowserVersion.CHROME);  
  13.         webClient.getOptions().setJavaScriptEnabled(true);              // 启用JS解释器,默认为true  
  14.         webClient.getOptions().setCssEnabled(false);                    // 禁用css支持  
  15.         webClient.getOptions().setThrowExceptionOnScriptError(false);   // js运行错误时,是否抛出异常  
  16.         webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);  
  17.         webClient.getOptions().setTimeout(10 * 1000);                   // 设置连接超时时间  
  18.         HtmlPage page = webClient.getPage(url);  
  19.         webClient.waitForBackgroundJavaScript(30 * 1000);               // 等待js后台执行30秒  
  20.   
  21.         String pageAsXml = page.asXml();  
  22.           
  23.         // Jsoup解析处理  
  24.         Document doc = Jsoup.parse(pageAsXml, "https://bluetata.com/");    
  25.         Elements pngs = doc.select("img[src$=.png]");                   // 获取所有图片元素集  
  26.         // 此处省略其他操作  
  27.         System.out.println(doc.toString());  
  28.     }  

 Jsoup学习讨论QQ群:50695115

 Jsoup爬虫代码示例及博客内源码下载:https://github.com/bluetata/crawler-jsoup-maven

 For more Jsoup related articles, please refer to the column: [Jsoup in action]

Note : This article was originally published on blog.csdn.net by ` blue t a t a` . Please be sure to indicate the source when reprinting.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325649881&siteId=291194637