HtmlUnit官网的介绍:
HtmlUnit是一款基于Java的没有图形界面的浏览器程序。它模仿HTML document并且提供API让开发人员像是在一个正常的浏览器上操作一样,获取网页内容,填充表单,点击超链接等等。
它非常好的支持JavaScript并且仍在不断改进,同时能够解析非常复杂的AJAX库,通过不同的配置来模拟Chrome、Firefox和IE浏览器。
本文针对一个足彩网站抓取的例子,来熟悉HtmlUnit
WebClient wc = new WebClient(BrowserVersion.FIREFOX_38); wc.getOptions().setJavaScriptEnabled(true); //启用JS解释器,默认为true wc.setJavaScriptTimeout(100000);//设置JS执行的超时时间 wc.getOptions().setCssEnabled(false); //禁用css支持 wc.getOptions().setThrowExceptionOnScriptError(false); //js运行错误时,是否抛出异常 wc.getOptions().setTimeout(10000); //设置连接超时时间 ,这里是10S。如果为0,则无限期等待 wc.setAjaxController(new NicelyResynchronizingAjaxController());//设置支持AJAX wc.setWebConnection( new WebConnectionWrapper(wc) { public WebResponse getResponse(WebRequest request) throws IOException { ...... } } ); HtmlPage page = wc.getPage("http://XXXX.com/"); FileWriter fileWriter = new FileWriter("D:\\text.html"); String str = ""; //获取页面的XML代码 str = page.asXml(); fileWriter.write( str ); //关闭webclient wc.close(); fileWriter.close();
解决数据乱码问题
该网站数据是由js动态载入,并且js有2种编码:
<script language="javascript" src="XXX.js" charset="gb2312"></script>
<script language="javascript" src="XXX.js" charset="utf-8"></script>
可以通过重写WebConnectionWrapper类的getResponse方法来修改返回值
例如,对bfdata.js的返回结果做修改
wc.setWebConnection( new WebConnectionWrapper(wc) { public WebResponse getResponse(WebRequest request) throws IOException { WebResponse response = super.getResponse(request); if (request.getUrl().toExternalForm().contains("bfdata.js")) { String content = response.getContentAsString("GBK"); WebResponseData data = new WebResponseData(content.getBytes("UTF-8"), response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders()); response = new WebResponse(data, request, response.getLoadTime()); } return response; } } );
解决Content is not allowed in prolog
报错信息:
六月 21, 2016 4:15:06 下午 com.gargoylesoftware.htmlunit.xml.XmlPage <init> 警告: Failed parsing XML document http://XXX/vbsxml/goalBf3.xml?r=0071466496906000: Content is not allowed in prolog. 六月 21, 2016 4:15:06 下午 com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine handleJavaScriptException 信息: Caught script exception ======= EXCEPTION START ======== EcmaError: lineNumber=[41] column=[0] lineSource=[<no source>] name=[TypeError] sourceName=[http://XXX/common2.js] message=[TypeError: Cannot read property "childNodes" from null (http://XXX/common2.js#41)] com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "childNodes" from null (http://XXX/common2.js#41) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:865) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:747) at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1032) at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:395) at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:276)
其中警告信息:Content is not allowed in prolog是导致后面报错的原因,而Content is not allowed in prolog是因为解析内容内包含BOM。这个标记是看不到的,而在流里面有这个标记。
因此可以通过以下代码来截取你需要的内容
wc.setWebConnection( new WebConnectionWrapper(wc) { public WebResponse getResponse(WebRequest request) throws IOException { WebResponse response = super.getResponse(request); if(request.getUrl().toExternalForm().contains("goalBf3.xml")){ String content = response.getContentAsString("UTF-8"); if(null != content && !"".equals(content)){ if(content.indexOf("<") != -1 && content.lastIndexOf(">") != -1 && content.lastIndexOf(">") > content.indexOf("<")) content = content.substring(content.indexOf("<"), content.lastIndexOf(">") + 1); } WebResponseData data = new WebResponseData(content.getBytes("UTF-8"), response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders()); response = new WebResponse(data, request, response.getLoadTime()); } return response; } } );
调用页面javascript函数
该网站有些数据是通过鼠标悬停来获得数据
我们可以通过page.executeJavaScript来执行js
例如:
HtmlPage page = wc.getPage("http://xxx.com/"); wc.waitForBackgroundJavaScript(30 * 1000); /* will wait JavaScript to execute up to 30s */ ScriptResult result = page.executeJavaScript("document.getElementById('pk_1248827').onmouseover(window.event)"); HtmlPage jspage = (HtmlPage) result.getNewPage();