Nutch开源搜索引擎增量索引recrawl的终极解决办法

本文重点是介绍Nutch开源搜索引擎如何在Hadoop分布式计算架构上进行recrawl，也就是在解决nutch增量索引的问题。google过来的章中没有一个详细解释整个过程的，经过一番痛苦的研究，最后找到了最终解决办法。

先按照自己部署好的Nutch架构写出recrawl的shell脚本，注意：如果本地索引，就需要调用bash的 rm、cp等命令，如果HDFS上的索引，就需要调用hadoop dfs -rmr 或者hadoop dfs -cp命令来处理，当然在用这个命令的同时，还需要处理一下命令的返回结果。写好脚本后，执行就可以了，或者放到crontab里面定时执行。

网上有一篇wiki，提供了一个shell脚本
http://wiki.apache.org/nutch/IntranetRecrawl#head-93eea6620f57b24dbe3591c293aead539a017ec7

下载下来后，满心欢喜的加到nutch/bin下面，然后执行命令
/nutch/search/bin/recrawl /nutch/tomcat/webapps/cse /user/nutch/crawl10 10 31
每个参数的意思是 tomcat_servlet_home ,nutch的HDFS上的crawl目录，10是深度，31是adddays

程序在执行过程中有报错，大致意思是没有找到mergesegs_dir目录等等，但是MapReduce的过程还在进行，我也没有太在意，先让它执行完毕再说吧。当执行完毕后，发现索引根本没有增加，而且在nutch目录下还多了一个mergesegs_dir。这个时候我开始检查recrawl.sh，发现在wiki上的shell脚本是针对本地索引来写的。于是，我开始修改 recrawl.sh文件，将其它的rm、cp命令修改成hadoop的命令。

然后再执行之前的命令，发现在generate这一步hadoop就报错了，无法执行下去。还好hadoop的log非常详细，在Job Failed里面发现报出一大堆Too many open files异常。又经过一番google后，发现在datanode这一端，需要将/etc/security/limits.conf中的文件打开参数调整一下，加入
nutch           soft    nofile          4096
nutch           hard   nofile          63536
nutch           soft    nproc          2047
nutch           hard   nproc          16384
调整完毕后，需要将hadoop重启一下，这一步很重要，否则会报同样的错误。
做完这些后，再去执行之前的命令，一切OK了。

最后，给大家分享下，我修改好的recrawl.sh，本人shell基础不好，凑合能用吧，哈哈。
#!/bin/bash

# Nutch recrawl script.
# Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
#
# The script merges the new segments all into one segment to prevent redundant
# data. However, if your crawl/segments directory is becoming very large, I
# would suggest you delete it completely and generate a new crawl. This probaly
# needs to be done every 6 months.
#
# Modified by Matthew Holt
# mholt at elon dot edu

if [ -n "$1" ]
then
tomcat_dir=$1
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi

if [ -n "$2" ]
then
crawl_dir=$2
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi

if [ -n "$3" ]
then
depth=$3
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi

if [ -n "$4" ]
then
adddays=$4
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomcat/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, ie: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi

if [ -n "$5" ]
then
topn="-topN $5"
else
topn=""
fi

#Sets the path to bin
nutch_dir=`dirname $0`
echo "nutch directory :$nutch_dir"

# Only change if your crawl subdirectories are named something different
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

hadoop="/nutch/search/bin/hadoop" # hadoop command

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
$nutch_dir/nutch generate $webdb_dir $segments_dir $topn -adddays $adddays
#segment=`ls -d $segments_dir/* | tail -1`
segment_tmp=`$hadoop dfs -ls $segments_dir | tail -1`
segment_tmp_len=`expr length "$segment_tmp"`
segment_tmp_end=`expr $segment_tmp_len - 6`
segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
echo "fetch update segment :$segment"
echo "fetch update segment_tmp :$segment_tmp"

$nutch_dir/nutch fetch $segment
$nutch_dir/nutch updatedb $webdb_dir $segment
done

# Merge segments and cleanup unused segments
mergesegs_dir=$crawl_dir/mergesegs_dir
$nutch_dir/nutch mergesegs $mergesegs_dir -dir $segments_dir

#for segment in `ls -d $segments_dir/* | tail -$depth`
for segment_tmp in `$hadoop dfs -ls $segments_dir | tail -$depth`
do
segment_tmp_len=`expr length "$segment_tmp"`
segment_tmp_end=`expr $segment_tmp_len - 6`
segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
echo "Removing Temporary Segment: $segment"
#rm -rf $segment
$hadoop dfs -rmr $segment
done

#cp -R $mergesegs_dir/* $segments_dir
#rm -rf $mergesegs_dir
$hadoop dfs -cp $mergesegs_dir/* $segments_dir
$hadoop dfs -rmr $mergesegs_dir

# Update segments
$nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
#segment=`ls -d $segments_dir/* | tail -1`
segment_tmp=`$hadoop dfs -ls $segments_dir | tail -1`
segment_tmp_len=`expr length "$segment_tmp"`
segment_tmp_end=`expr $segment_tmp_len - 6`
segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
echo "Index segment :$segment"
$nutch_dir/nutch index $new_indexes $webdb_dir $linkdb_dir $segment

# De-duplicate indexes
$nutch_dir/nutch dedup $new_indexes

# Merge indexes
$nutch_dir/nutch merge $index_dir $new_indexes

# Tell Tomcat to reload index
touch $tomcat_dir/WEB-INF/web.xml

# Clean up
#rm -rf $new_indexes
$hadoop dfs -rmr $new_indexes

echo "FINISHED: Recrawl completed. To conserve disk space, I would suggest"
echo " that the crawl directory be deleted once every 6 months (or more"
echo " frequent depending on disk constraints) and a new crawl generated."

Nutch开源搜索引擎增量索引recrawl的终极解决办法

猜你喜欢