Nutch开源搜索引擎增量索引recrawl的终极解决办法

本文重点是介绍Nutch开源搜索引擎如何在Hadoop分布式计算架构上进行recrawl,也就是在解决nutch增量索引的问题。google过来的章中没有一个详细解释整个过程的,经过一番痛苦的研究,最后找到了最终解决办法。

先按照自己部署好的Nutch架构写出recrawl的shell脚本,注意:如果本地索引,就需要调用bash的 rm、cp等命令,如果HDFS上的索引,就需要调用hadoop dfs -rmr 或者hadoop dfs -cp命令来处理,当然在用这个命令的同时,还需要处理一下命令的返回结果。写好脚本后,执行就可以了,或者放到crontab里面定时执行。

网上有一篇wiki,提供了一个shell脚本
http://wiki.apache.org/nutch/IntranetRecrawl#head-93eea6620f57b24dbe3591c293aead539a017ec7

下载下来后,满心欢喜的加到nutch/bin下面,然后执行命令
/nutch/search/bin/recrawl /nutch/tomcat/webapps/cse /user/nutch/crawl10 10 31
每个参数的意思是 tomcat_servlet_home ,nutch的HDFS上的crawl目录,10是深度,31是adddays

程序在执行过程中有报错,大致意思是没有找到mergesegs_dir目录等等,但是MapReduce的过程还在进行,我也没有太在意,先让它执行完毕再说吧。当执行完毕后,发现索引根本没有增加,而且在nutch目录下还多了一个mergesegs_dir。这个时候我开始检查recrawl.sh,发现在wiki上的shell脚本是针对本地索引来写的。于是,我开始修改 recrawl.sh文件,将其它的rm、cp命令修改成hadoop的命令。

然后再执行之前的命令,发现在generate这一步hadoop就报错了,无法执行下去。还好hadoop的log非常详细,在Job Failed里面发现报出一大堆Too many open files异常。又经过一番google后,发现在datanode这一端,需要将/etc/security/limits.conf中的文件打开参数调整一下,加入
nutch           soft    nofile          4096
nutch           hard   nofile          63536
nutch           soft    nproc          2047
nutch           hard   nproc          16384
调整完毕后,需要将hadoop重启一下,这一步很重要,否则会报同样的错误。
做完这些后,再去执行之前的命令,一切OK了。

最后,给大家分享下,我修改好的recrawl.sh,本人shell基础不好,凑合能用吧,哈哈。
#!/bin/bash

# Nutch recrawl script.
# Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
#
# The script merges the new segments all into one segment to prevent redundant
# data. However, if your crawl/segments directory is becoming very large, I
# would suggest you delete it completely and generate a new crawl. This probaly
# needs to be done every 6 months.
#
# Modified by Matthew Holt
# mholt at elon dot edu

if [ -n "$1" ]
then
  tomcat_dir=$1
else
  echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
  echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
  echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
  echo "depth - The link depth from the root page that should be crawled."
  echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
  echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
  exit 1
fi

if [ -n "$2" ]
then
  crawl_dir=$2
else
  echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
  echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
  echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
  echo "depth - The link depth from the root page that should be crawled."
  echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
  echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
  exit 1
fi

if [ -n "$3" ]
then
  depth=$3
else
  echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
  echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
  echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
  echo "depth - The link depth from the root page that should be crawled."
  echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
  echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
  exit 1
fi

if [ -n "$4" ]
then
  adddays=$4
else
  echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
  echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomcat/webapps/ROOT)"
  echo "crawl_dir - Path of the directory the crawl is located in. (full path, ie: /home/user/nutch/crawl)"
  echo "depth - The link depth from the root page that should be crawled."
  echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
  echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
  exit 1
fi

if [ -n "$5" ]
then
  topn="-topN $5"
else
  topn=""
fi

#Sets the path to bin
nutch_dir=`dirname $0`
echo "nutch directory :$nutch_dir"

# Only change if your crawl subdirectories are named something different
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

hadoop="/nutch/search/bin/hadoop" # hadoop command

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
  $nutch_dir/nutch generate $webdb_dir $segments_dir $topn -adddays $adddays
  #segment=`ls -d $segments_dir/* | tail -1`
  segment_tmp=`$hadoop dfs -ls $segments_dir | tail -1`
  segment_tmp_len=`expr length "$segment_tmp"`
  segment_tmp_end=`expr $segment_tmp_len - 6`
  segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
  echo "fetch update segment :$segment"
  echo "fetch update segment_tmp :$segment_tmp"

  $nutch_dir/nutch fetch $segment
  $nutch_dir/nutch updatedb $webdb_dir $segment
done

# Merge segments and cleanup unused segments
mergesegs_dir=$crawl_dir/mergesegs_dir
$nutch_dir/nutch mergesegs $mergesegs_dir -dir $segments_dir

#for segment in `ls -d $segments_dir/* | tail -$depth`
for segment_tmp in `$hadoop dfs -ls $segments_dir | tail -$depth`
do
  segment_tmp_len=`expr length "$segment_tmp"`
  segment_tmp_end=`expr $segment_tmp_len - 6`
  segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
  echo "Removing Temporary Segment: $segment"
  #rm -rf $segment
  $hadoop dfs -rmr $segment
done

#cp -R $mergesegs_dir/* $segments_dir
#rm -rf $mergesegs_dir
$hadoop dfs -cp $mergesegs_dir/* $segments_dir
$hadoop dfs -rmr $mergesegs_dir

# Update segments
$nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
#segment=`ls -d $segments_dir/* | tail -1`
segment_tmp=`$hadoop dfs -ls $segments_dir | tail -1`
  segment_tmp_len=`expr length "$segment_tmp"`
  segment_tmp_end=`expr $segment_tmp_len - 6`
  segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
  echo "Index segment :$segment"
$nutch_dir/nutch index $new_indexes $webdb_dir $linkdb_dir $segment

# De-duplicate indexes
$nutch_dir/nutch dedup $new_indexes

# Merge indexes
$nutch_dir/nutch merge $index_dir $new_indexes

# Tell Tomcat to reload index
touch $tomcat_dir/WEB-INF/web.xml

# Clean up
#rm -rf $new_indexes
$hadoop dfs -rmr $new_indexes

echo "FINISHED: Recrawl completed. To conserve disk space, I would suggest"
echo " that the crawl directory be deleted once every 6 months (or more"
echo " frequent depending on disk constraints) and a new crawl generated."

猜你喜欢

转载自banditjava.iteye.com/blog/247218
今日推荐