In use Jsoup
process will encounter can not be resolved through javascript
html pages dynamically generated,
solution is to use HtmlUnit+Jsoup
to parse dynamic dynamic
pages.
Jsoup
Focus is to parse html, fast parse html using similar JQuery's API, and will not have its own position on the simulation browser, not at this stage some people say Jsoup
for lack of browser simulation, etc., I explain to colleagues in or write a blog when indicated. For the simulation browser, access to non-windowed, there are many excellent open source frameworks, such as HttpClient
, for example, also introduced today HtmlUnit
, also or Selenium
.
"Until now Jsoup-1.10.4
is not dynamically loaded js content, the author's official response is:" Javascript is not supported. Jsoup parses HTML
. "."
HtmlUnit
And Selenium
are themselves open source framework for testing, the underlying itself Selenium is also used HtmlUnit
, so in this paper describes the use of direct processing dynamic HtmlUnit Html issue js loaded.
〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜〜
If you are using Maven build the project, please refer to the following configuration, if it is introduced directly into the jar packets directly to the official website to download, in the group file is available for download jar package and API help documentation
<!-- https://mvnrepository.com/artifact/net.sourceforge.htmlunit/htmlunit -->
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.29</version>
</dependency>
Sample code:
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
// 屏蔽HtmlUnit等系统 log
LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log","org.apache.commons.logging.impl.NoOpLog");
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.http.client").setLevel(Level.OFF);
String url = "https://bluetata.com/";
System.out.println("Loading page now-----------------------------------------------: "+url);
// HtmlUnit 模拟浏览器
WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true); // 启用JS解释器,默认为true
webClient.getOptions().setCssEnabled(false); // 禁用css支持
webClient.getOptions().setThrowExceptionOnScriptError(false); // js运行错误时,是否抛出异常
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setTimeout(10 * 1000); // 设置连接超时时间
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(30 * 1000); // 等待js后台执行30秒
String pageAsXml = page.asXml();
// Jsoup解析处理
Document doc = Jsoup.parse(pageAsXml, "https://bluetata.com/");
Elements pngs = doc.select("img[src$=.png]"); // 获取所有图片元素集
// 此处省略其他操作
System.out.println(doc.toString());
}