Use HtmlUnit + Jsoup parse dynamic pages

Copyright: arbitrary, seeding https://blog.csdn.net/qq_32662595/article/details/88189584!

In use Jsoupprocess will encounter can not be resolved through javascripthtml pages dynamically generated,
solution is to use HtmlUnit+Jsoupto parse dynamic dynamicpages.

JsoupFocus is to parse html, fast parse html using similar JQuery's API, and will not have its own position on the simulation browser, not at this stage some people say Jsoupfor lack of browser simulation, etc., I explain to colleagues in or write a blog when indicated. For the simulation browser, access to non-windowed, there are many excellent open source frameworks, such as HttpClient, for example, also introduced today HtmlUnit, also or Selenium.

"Until now Jsoup-1.10.4is not dynamically loaded js content, the author's official response is:" Javascript is not supported. Jsoup parses HTML. "."
HtmlUnitAnd Seleniumare themselves open source framework for testing, the underlying itself Selenium is also used HtmlUnit, so in this paper describes the use of direct processing dynamic HtmlUnit Html issue js loaded.

〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜

If you are using Maven build the project, please refer to the following configuration, if it is introduced directly into the jar packets directly to the official website to download, in the group file is available for download jar package and API help documentation

<!-- https://mvnrepository.com/artifact/net.sourceforge.htmlunit/htmlunit -->
<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.29</version>
</dependency>

Sample code:

public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {

    // 屏蔽HtmlUnit等系统 log
	    LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log","org.apache.commons.logging.impl.NoOpLog");
	    java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
	    java.util.logging.Logger.getLogger("org.apache.http.client").setLevel(Level.OFF);
	 
	    String url = "https://bluetata.com/";
	    System.out.println("Loading page now-----------------------------------------------: "+url);
	    
	    // HtmlUnit 模拟浏览器
	    WebClient webClient = new WebClient(BrowserVersion.CHROME);
	    webClient.getOptions().setJavaScriptEnabled(true);              // 启用JS解释器,默认为true
	    webClient.getOptions().setCssEnabled(false);                    // 禁用css支持
	    webClient.getOptions().setThrowExceptionOnScriptError(false);   // js运行错误时,是否抛出异常
	    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
	    webClient.getOptions().setTimeout(10 * 1000);                   // 设置连接超时时间
	    HtmlPage page = webClient.getPage(url);
	    webClient.waitForBackgroundJavaScript(30 * 1000);               // 等待js后台执行30秒
	 
	    String pageAsXml = page.asXml();
	    
	    // Jsoup解析处理
	    Document doc = Jsoup.parse(pageAsXml, "https://bluetata.com/");  
	    Elements pngs = doc.select("img[src$=.png]");                   // 获取所有图片元素集
	    // 此处省略其他操作
	    System.out.println(doc.toString());
    }

Guess you like

Origin blog.csdn.net/qq_32662595/article/details/88189584