爬虫 js,flash,ajax网页(JREX)

抓取的过程中会遇到很多对爬虫不友好的页面,比如js,ajax,flash等等,正在为这些页面苦恼时发现JREX,调用firefox内核渲染页面可以很好的解决这些问题
不过现在JREX已经没有人维护了最新版是在05年发布的

"JRex" is a Java Browser Component with set of API's for Embedding Mozilla GECKO within a Java Application.

一、       安装
网址: http://jrex.mozdev.org/

1.       解压缩 jrex_gre.zip 到 C:\jrex_gre 目录中

2.       然后将 jrex-bin-log-1.0b1_dom3.zip中文件复制到 C:\jrex_gre 目录中。

3.       直接运行run.bat即可看到用jrex实现的java浏览器,还不错噢。

注意,那个JAVA_HOME应该是JRE的,而不是JDK的,否则会找不到的一个jwt.dll

"C:\Program Files\Java\jre1.5.0_06/bin/java"


二、       编程

实现效果: firefox中的view generated Source

代码如下:


import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.StringWriter;

import javax.swing.JFrame;

import javax.swing.JPanel;

import javax.xml.transform.OutputKeys;

import javax.xml.transform.Result;

import javax.xml.transform.Source;

import javax.xml.transform.Transformer;

import javax.xml.transform.TransformerFactory;

import javax.xml.transform.dom.DOMSource;

import javax.xml.transform.stream.StreamResult;

import org.mozilla.jrex.JRexFactory;

import org.mozilla.jrex.event.progress.ProgressEvent;

import org.mozilla.jrex.navigation.WebNavigation;

import org.mozilla.jrex.navigation.WebNavigationConstants;

import org.mozilla.jrex.ui.JRexCanvas;

import org.mozilla.jrex.window.JRexWindowManager;

import org.w3c.dom.Document;

import org.w3c.dom.Element;

import org.w3c.dom.Node;

public class Render implements org.mozilla.jrex.event.progress.ProgressListener {

boolean done = false;

public boolean parsePage(String url) throws Exception {

   System.setProperty("jrex.browser.usesetupflags", "true");

   System.setProperty("jrex.browser.allow.images", "false"); //不加载图片

   System.setProperty("jrex.browser.allow.plugin", "false"); //不加载flash

// The JRexCanvas is the main browser component. The WebNavigator

   // is used to access the DOM.

   JRexCanvas canvas = null;

   WebNavigation navigation = null;

   // Start up JRex/Gecko.

   JRexFactory.getInstance().startEngine();

   // Get a window manager and put the browser in a Swing frame.

   // Based on Dietrich Kappe's code.

   JRexWindowManager winManager = (JRexWindowManager) JRexFactory

   .getInstance().getImplInstance(JRexFactory.WINDOW_MANAGER);

   winManager.create(JRexWindowManager.SINGLE_WINDOW_MODE);

   JPanel panel = new JPanel();

   JFrame frame = new JFrame();

   frame.getContentPane().add(panel);

   winManager.init(panel);

   // Get the JRexCanvas, set Render to handle progress events so

   // we can determine when the page is loaded, and get the

   // WebNavigator object.

   canvas = (JRexCanvas) winManager.getBrowserForParent(panel);

   canvas.addProgressListener(this);

   navigation = canvas.getNavigator();

   // Load and process the page.

   navigation.loadURI(url, WebNavigationConstants.LOAD_FLAGS_NONE, null,

   null, null);

   // Swing magic.

   frame.setSize(640, 480);

   frame.setVisible(false);

   // Check if the DOM has loaded every two seconds.

   while (!done) {

    Thread.sleep(2000);

   }

   // Get the DOM and recurse on its nodes.

   Document doc = navigation.getDocument();

   Element ex = doc.getDocumentElement();

  
File file = new File("d:\\youtube.html");
FileOutputStream outer = new FileOutputStream(file);
OutputStreamWriter sw = new OutputStreamWriter(outer,"utf-8");
sw.write(xmlToString(ex));
sw.close();

System.out.println(xmlToString(ex));

   return true;

}

public static String xmlToString(Node node) throws Exception {

   Source source = new DOMSource(node);

   StringWriter stringWriter = new StringWriter();

   Result result = new StreamResult(stringWriter);

   TransformerFactory factory = TransformerFactory.newInstance();

   Transformer transformer = factory.newTransformer();

   transformer.setOutputProperty(OutputKeys.METHOD, "html");

   transformer.transform(source, result);

   return stringWriter.getBuffer().toString();

}

/**

* onStateChange is invoked several times when DOM loading is complete. Set

* the done flag the first time.

*/

public void onStateChange(ProgressEvent event) {

   if (!event.isLoadingDocument()) {

    if (done)

     return;

    done = true;

   }

}

public static void main(String[] args) throws Exception {

  
//String url = "http://www.youtube.com/watch?v=XOHE2KsmdGg";
//String url = "http://www.cnn.com";
String url = "http://www.56.com/u42/v_MzY2NTYxNjc.html";
//String url = "http://ilovelate.blog.163.com";

Render p = new Render();

   p.parsePage(url);

   System.exit(0);

}

public void onLinkStatusChange(ProgressEvent event) {

}

public void onLocationChange(ProgressEvent event) {

}

public void onProgressChange(ProgressEvent event) {

}

public void onSecurityChange(ProgressEvent event) {

}

public void onStatusChange(ProgressEvent event) {

}

}

运行该代码需要设置vm arguments
-Djrex.dom.enable=true
-Djrex.gre.path=c:\jrex_gre

注意修改File file = new File("d:\\youtube.html"); 输出文件。

设置环境变量
JAVA_HOME = C:\Java\jre1.5.0   不是jdk目录。
JREX_GRE_PATH=c:\jrex_gre    

不足和问题
Render是使用JRex的一个简单例子,但不是全部。我在挖掘网页时使用Render的一个子类,它工作的很好,但是我测试的例子都是很正常的网页。
我使用一个事件监听器来判断页面是否加载完毕。Render的parsePage方法每过两秒就检测一下doneflag。如果页眉不能加载,就会死循环。
还有当它加载嵌入的浏览器时,浏览器窗口会显示出来,直到加载成功。我没有考虑这个问题因为在我的挖掘任务中不需要浏览器窗口。

猜你喜欢

转载自wangwei3.iteye.com/blog/806700