Nutch是一个优秀的开源的数据爬取框架

Nutch是一个优秀的开源的数据爬取框架，我们只需要简单的配置，就可以完成数据爬取，当然，Nutch里面也提供了很灵活的的插件机制，我们随时都可以对它进行二次开发，以满足我们的需求，本篇散仙，先来介绍下，如何在eclipse里面以local模式调试nutch，只有在eclipse里面把它弄清楚了，那么，我们学习起来，才会更加容易，因为，目前大多数人，使用nutch，都是基于命令行的操作，虽然很简单方便，但是想深入定制开发，就很困难，所以，散仙在本篇里，会介绍下nutch基本的调试，以及编译。

下面进入正题，我们先来看下基本的步骤。

序号

名称

描述

安装部署ant

编译nutch编码使用

下载nutch源码

必须步骤

在nutch源码根目录下，执行ant等待编译完成

构建nutch

配置nutch-site.xml

必须步骤

ant eclipse 构建eclipse项目

导入eclipse中，调试

conf目录置顶

nutch加载时，会读取配置文件

执行org.apache.nutch.crawl.Injector注入种子

local调试

执行org.apache.nutch.crawl.Generator生成一个抓取列表

local调试

执行org.apache.nutch.fetcher.Fetcher生成一个抓取队列

local调试

执行org.apache.nutch.parse.ParseSegment执行contet生一个段文件

local调试

配置好solr服务

检索服务查询

执行org.apache.nutch.indexer.IndexingJob映射solr索引

local调试

映射完成后，就可以solr里面执行查询了

校验结果

编译完，导入eclipse的中如下图所示，注意conf文件夹置顶：

nutch-site.xml里面的配置如下：

Xml代码

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>mynutch</value>
</property>
<property>
<name>http.robots.agents</name>
<value>*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>plugin.folders</name>
<value>./src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
</configuration>

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>http.agent.name</name>
  <value>mynutch</value>
</property>


<property>
  <name>http.robots.agents</name>
  <value>*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>plugin.folders</name>
  <value>./src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

</configuration>

下面简单介绍下，在各个类里运行，需要做的一些改动，首先运行nutch，是基于Hadoop的local模式调试的，所以，你得改变下hadoop的权限，否则在运行过程中，会报错。散仙在这里提供一个简单的方法，拷贝hadoop的FileUtils类进行eclipse中，修改它的权限校验即可，如果你是在linux上运行，就不需要考虑这个问题了。

在开始调试之前，你需要在项目的根目录下建一个urls文件夹，并新建一个种子文件放入你要抓取的网址。

在Injector类里面，run方法里，改成

Java代码

public int run(String[] args) throws Exception {
// if (args.length < 2) {
// System.err.println("Usage: Injector <crawldb> <url_dir>");
// return -1;
// }
args=new String[]{"mydir","urls"};//urls
try {
inject(new Path(args[0]), new Path(args[1]));
return 0;
} catch (Exception e) {
LOG.error("Injector: " + StringUtils.stringifyException(e));
return -1;
}
}

  public int run(String[] args) throws Exception {
//    if (args.length < 2) {
//      System.err.println("Usage: Injector <crawldb> <url_dir>");
//      return -1;
//    }
	  args=new String[]{"mydir","urls"};//urls
    try {
      inject(new Path(args[0]), new Path(args[1]));
      return 0;
    } catch (Exception e) {
      LOG.error("Injector: " + StringUtils.stringifyException(e));
      return -1;
    }
  }

在Generator里面的run方法改成

Java代码

public int run(String[] args) throws Exception {
// if (args.length < 2) {
// System.out
// .println("Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]");
// return -1;
// }
args=new String[]{"mydir","myseg","6","7",""};
Path dbDir = new Path(args[0]);
Path segmentsDir = new Path(args[1]);
long curTime = System.currentTimeMillis();
long topN = Long.MAX_VALUE;
int numFetchers = -1;
boolean filter = true;
boolean norm = true;
boolean force = false;
int maxNumSegments = 1;
for (int i = 2; i < args.length; i++) {
if ("-topN".equals(args[i])) {
topN = Long.parseLong(args[i + 1]);
i++;
} else if ("-numFetchers".equals(args[i])) {
numFetchers = Integer.parseInt(args[i + 1]);
i++;
} else if ("-adddays".equals(args[i])) {
long numDays = Integer.parseInt(args[i + 1]);
curTime += numDays * 1000L * 60 * 60 * 24;
} else if ("-noFilter".equals(args[i])) {
filter = false;
} else if ("-noNorm".equals(args[i])) {
norm = false;
} else if ("-force".equals(args[i])) {
force = true;
} else if ("-maxNumSegments".equals(args[i])) {
maxNumSegments = Integer.parseInt(args[i + 1]);
}
}
try {
Path[] segs = generate(dbDir, segmentsDir, numFetchers, topN, curTime, filter,
norm, force, maxNumSegments);
if (segs == null) return -1;
} catch (Exception e) {
LOG.error("Generator: " + StringUtils.stringifyException(e));
return -1;
}
return 0;
}

public int run(String[] args) throws Exception {
//    if (args.length < 2) {
//      System.out
//          .println("Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]");
//      return -1;
//    }
	  
	  args=new String[]{"mydir","myseg","6","7",""};

    Path dbDir = new Path(args[0]);
    Path segmentsDir = new Path(args[1]);
    long curTime = System.currentTimeMillis();
    long topN = Long.MAX_VALUE;
    int numFetchers = -1;
    boolean filter = true;
    boolean norm = true;
    boolean force = false;
    int maxNumSegments = 1;

    for (int i = 2; i < args.length; i++) {
      if ("-topN".equals(args[i])) {
        topN = Long.parseLong(args[i + 1]);
        i++;
      } else if ("-numFetchers".equals(args[i])) {
        numFetchers = Integer.parseInt(args[i + 1]);
        i++;
      } else if ("-adddays".equals(args[i])) {
        long numDays = Integer.parseInt(args[i + 1]);
        curTime += numDays * 1000L * 60 * 60 * 24;
      } else if ("-noFilter".equals(args[i])) {
        filter = false;
      } else if ("-noNorm".equals(args[i])) {
        norm = false;
      } else if ("-force".equals(args[i])) {
        force = true;
      } else if ("-maxNumSegments".equals(args[i])) {
        maxNumSegments = Integer.parseInt(args[i + 1]);
      }

    }

    try {
      Path[] segs = generate(dbDir, segmentsDir, numFetchers, topN, curTime, filter,
          norm, force, maxNumSegments);
      if (segs == null) return -1;
    } catch (Exception e) {
      LOG.error("Generator: " + StringUtils.stringifyException(e));
      return -1;
    }
    return 0;
  }

在Fetcher的run方法里面改动：

Java代码

public int run(String[] args) throws Exception {
String usage = "Usage: Fetcher <segment> [-threads n]";
args=new String[]{"D:\\20140520nutchplugin\\apache-nutch-1.8\\myseg\\20140520120541","4"};
// if (args.length < 1) {
// System.err.println(usage);
// return -1;
// }
Path segment = new Path(args[0]);
int threads = getConf().getInt("fetcher.threads.fetch", 10);
boolean parsing = false;
for (int i = 1; i < args.length; i++) { // parse command line
if (args[i].equals("-threads")) { // found -threads option
threads = Integer.parseInt(args[++i]);
}
}
getConf().setInt("fetcher.threads.fetch", threads);
try {
fetch(segment, threads);
return 0;
} catch (Exception e) {
LOG.error("Fetcher: " + StringUtils.stringifyException(e));
return -1;
}
}

  public int run(String[] args) throws Exception {

    String usage = "Usage: Fetcher <segment> [-threads n]";
 
     args=new String[]{"D:\\20140520nutchplugin\\apache-nutch-1.8\\myseg\\20140520120541","4"};
//    if (args.length < 1) {
//      System.err.println(usage);
//      return -1;
//    }

    Path segment = new Path(args[0]);

    int threads = getConf().getInt("fetcher.threads.fetch", 10);
    boolean parsing = false;

    for (int i = 1; i < args.length; i++) {       // parse command line
      if (args[i].equals("-threads")) {           // found -threads option
        threads =  Integer.parseInt(args[++i]);
      }
    }

    getConf().setInt("fetcher.threads.fetch", threads);

    try {
      fetch(segment, threads);
      return 0;
    } catch (Exception e) {
      LOG.error("Fetcher: " + StringUtils.stringifyException(e));
      return -1;
    }

  }

在ParseSegment里面的run方法改动：

Java代码

public int run(String[] args) throws Exception {
Path segment;
String usage = "Usage: ParseSegment segment [-noFilter] [-noNormalize]";
// if (args.length == 0) {
// System.err.println(usage);
// System.exit(-1);
// }
args=new String[]{"D:\\20140520nutchplugin\\apache-nutch-1.8\\myseg\\20140520120541"};
if(args.length > 1) {
for(int i = 1; i < args.length; i++) {
String param = args[i];
if("-nofilter".equalsIgnoreCase(param)) {
getConf().setBoolean("parse.filter.urls", false);
} else if ("-nonormalize".equalsIgnoreCase(param)) {
getConf().setBoolean("parse.normalize.urls", false);
}
}
}
segment = new Path(args[0]);
parse(segment);
return 0;
}

 public int run(String[] args) throws Exception {
    Path segment;

    String usage = "Usage: ParseSegment segment [-noFilter] [-noNormalize]";

//    if (args.length == 0) {
//      System.err.println(usage);
//      System.exit(-1);
//    }

     args=new String[]{"D:\\20140520nutchplugin\\apache-nutch-1.8\\myseg\\20140520120541"};
    if(args.length > 1) {
      for(int i = 1; i < args.length; i++) {
        String param = args[i];

        if("-nofilter".equalsIgnoreCase(param)) {
          getConf().setBoolean("parse.filter.urls", false);
        } else if ("-nonormalize".equalsIgnoreCase(param)) {
          getConf().setBoolean("parse.normalize.urls", false);
        }
      }
    }

    segment = new Path(args[0]);
    parse(segment);
    return 0;
  }

在IndexingJob的run方法里面改动：

Java代码

public int run(String[] args) throws Exception {
args=new String[]{"mydir","D:\\20140520nutchplugin\\apache-nutch-1.8\\myseg\\20140520120541"};
if (args.length < 2) {
System.err
.println("Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]");
IndexWriters writers = new IndexWriters(getConf());
System.err.println(writers.describe());
return -1;
}
final Path crawlDb = new Path(args[0]);
Path linkDb = null;
final List<Path> segments = new ArrayList<Path>();
String params = null;
boolean noCommit = false;
boolean deleteGone = false;
boolean filter = false;
boolean normalize = false;
for (int i = 1; i < args.length; i++) {
if (args[i].equals("-linkdb")) {
linkDb = new Path(args[++i]);
} else if (args[i].equals("-dir")) {
Path dir = new Path(args[++i]);
FileSystem fs = dir.getFileSystem(getConf());
FileStatus[] fstats = fs.listStatus(dir,
HadoopFSUtil.getPassDirectoriesFilter(fs));
Path[] files = HadoopFSUtil.getPaths(fstats);
for (Path p : files) {
segments.add(p);
}
} else if (args[i].equals("-noCommit")) {
noCommit = true;
} else if (args[i].equals("-deleteGone")) {
deleteGone = true;
} else if (args[i].equals("-filter")) {
filter = true;
} else if (args[i].equals("-normalize")) {
normalize = true;
} else if (args[i].equals("-params")) {
params = args[++i];
} else {
segments.add(new Path(args[i]));
}
}
try {
index(crawlDb, linkDb, segments, noCommit, deleteGone, params,
filter, normalize);
return 0;
} catch (final Exception e) {
LOG.error("Indexer: " + StringUtils.stringifyException(e));
return -1;
}
}

  public int run(String[] args) throws Exception {
    	args=new String[]{"mydir","D:\\20140520nutchplugin\\apache-nutch-1.8\\myseg\\20140520120541"};
        if (args.length < 2) {
            System.err
                    .println("Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]");
            IndexWriters writers = new IndexWriters(getConf());
            System.err.println(writers.describe());
            return -1;
        }

        final Path crawlDb = new Path(args[0]);
        Path linkDb = null;

        final List<Path> segments = new ArrayList<Path>();
        String params = null;

        boolean noCommit = false;
        boolean deleteGone = false;
        boolean filter = false;
        boolean normalize = false;

        for (int i = 1; i < args.length; i++) {
            if (args[i].equals("-linkdb")) {
                linkDb = new Path(args[++i]);
            } else if (args[i].equals("-dir")) {
                Path dir = new Path(args[++i]);
                FileSystem fs = dir.getFileSystem(getConf());
                FileStatus[] fstats = fs.listStatus(dir,
                        HadoopFSUtil.getPassDirectoriesFilter(fs));
                Path[] files = HadoopFSUtil.getPaths(fstats);
                for (Path p : files) {
                    segments.add(p);
                }
            } else if (args[i].equals("-noCommit")) {
                noCommit = true;
            } else if (args[i].equals("-deleteGone")) {
                deleteGone = true;
            } else if (args[i].equals("-filter")) {
                filter = true;
            } else if (args[i].equals("-normalize")) {
                normalize = true;
            } else if (args[i].equals("-params")) {
                params = args[++i];
            } else {
                segments.add(new Path(args[i]));
            }
        }

        try {
            index(crawlDb, linkDb, segments, noCommit, deleteGone, params,
                    filter, normalize);
            return 0;
        } catch (final Exception e) {
            LOG.error("Indexer: " + StringUtils.stringifyException(e));
            return -1;
        }
    }

除此之外，还需要，在SolrIndexWriter的187行和SolrUtils的54行分别添加如下代码，修改solr的映射地址：

Java代码

String serverURL = conf.get(SolrConstants.SERVER_URL);
serverURL="http://localhost:8983/solr/";

String serverURL = conf.get(SolrConstants.SERVER_URL);
        serverURL="http://localhost:8983/solr/";

Java代码

// String serverURL = job.get(SolrConstants.SERVER_URL);
String serverURL ="http://localhost:8983/solr";

// String serverURL = job.get(SolrConstants.SERVER_URL);
    String serverURL ="http://localhost:8983/solr";

按上面几个步骤，每执行一个类的时候，就修改其的运行参数，因为nutch的作业具有依赖性，这一个作业的输入，往往是上一个作业的输出，手动依次运行修改上面的5个类，最终我们的索引就可以生成在solr里，截图如下：

当然，我们还可以，配置分词策略，来使我们检索更加通用，准确.

Nutch是一个优秀的开源的数据爬取框架

猜你喜欢