很多网页数据是通过异步加载的方式加载,通过jsoup获取数据只能获取静态数据.如果需要获取异步加载后的数据,这时候就需要htmlunit 了。话不多说,直接看代码!
maven 配置:
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
<version>1.2</version>
</dependency>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.2</version>
</dependency>
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.25</version>
</dependency>
也可以直接打包下载:https://download.csdn.net/download/songanshu/10619303
以获取腾讯QQ在线人数为例。只作学习参考使用。
直接运行代码:
public static void main(String[] args) {
//获取地址
String url="https://im.qq.com";
//构造webClient
WebClient webClient = new WebClient(BrowserVersion.CHROME);
//屏蔽日志信息 不然消耗IO资源 LogFactory.getFactory().setAttribute("org.apache.commons.logging.Log","org.apache.commons.logging.impl.NoOpLog");
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.http.client").setLevel(Level.OFF);
//支持JavaScript
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setActiveXNative(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
//设置连接超时时间
webClient.getOptions().setTimeout(5000);
HtmlPage rootPage = webClient.getPage(url);
//设置运行JavaScript的时间
webClient.waitForBackgroundJavaScript(2000);
String html = rootPage.asXml();
Document document = Jsoup.parse(html);
//获取在线人数
Elements elements1 = document.select("#cur_online");
String value=elements1.get(0).text();
//替换格式 原格式为000,000,000,000 替换后格式为000000000
String newValue=value.replaceAll(",","");
Date day=new Date();
SimpleDateFormat df = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Integer re=Integer.parseInt(newValue);
System.out.println(df.format(day)+"----->"+re);
}