Nutch开源搜索引擎增量索引recrawl的终极解决办法

banditjava

浏览: 160578 次
性别:
来自: 北京

最近访客更多访客>>

wangyy

pengcong90

superlongde

Mr_Tian_ht

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

搜索引擎

搜索引擎 Hadoop Tomcat Servlet IE

本文重点是介绍Nutch开源搜索引擎如何在Hadoop分布式计算架构上进行recrawl，也就是在解决nutch增量索引的问题。google过来的章中没有一个详细解释整个过程的，经过一番痛苦的研究，最后找到了最终解决办法。

先按照自己部署好的Nutch架构写出recrawl的shell脚本，注意：如果本地索引，就需要调用bash的 rm、cp等命令，如果HDFS上的索引，就需要调用hadoop dfs -rmr 或者hadoop dfs -cp命令来处理，当然在用这个命令的同时，还需要处理一下命令的返回结果。写好脚本后，执行就可以了，或者放到crontab里面定时执行。

网上有一篇wiki，提供了一个shell脚本
http://wiki.apache.org/nutch/IntranetRecrawl#head-93eea6620f57b24dbe3591c293aead539a017ec7

下载下来后，满心欢喜的加到nutch/bin下面，然后执行命令
/nutch/search/bin/recrawl /nutch/tomcat/webapps/cse /user/nutch/crawl10 10 31
每个参数的意思是 tomcat_servlet_home ,nutch的HDFS上的crawl目录，10是深度，31是adddays

程序在执行过程中有报错，大致意思是没有找到mergesegs_dir目录等等，但是MapReduce的过程还在进行，我也没有太在意，先让它执行完毕再说吧。当执行完毕后，发现索引根本没有增加，而且在nutch目录下还多了一个mergesegs_dir。这个时候我开始检查recrawl.sh，发现在wiki上的shell脚本是针对本地索引来写的。于是，我开始修改 recrawl.sh文件，将其它的rm、cp命令修改成hadoop的命令。

然后再执行之前的命令，发现在generate这一步hadoop就报错了，无法执行下去。还好hadoop的log非常详细，在Job Failed里面发现报出一大堆Too many open files异常。又经过一番google后，发现在datanode这一端，需要将/etc/security/limits.conf中的文件打开参数调整一下，加入
nutch           soft    nofile          4096
nutch           hard   nofile          63536
nutch           soft    nproc          2047
nutch           hard   nproc          16384
调整完毕后，需要将hadoop重启一下，这一步很重要，否则会报同样的错误。
做完这些后，再去执行之前的命令，一切OK了。

最后，给大家分享下，我修改好的recrawl.sh，本人shell基础不好，凑合能用吧，哈哈。
#!/bin/bash

# Nutch recrawl script.
# Based on 0.7.2 script at http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html
#
# The script merges the new segments all into one segment to prevent redundant
# data. However, if your crawl/segments directory is becoming very large, I
# would suggest you delete it completely and generate a new crawl. This probaly
# needs to be done every 6 months.
#
# Modified by Matthew Holt
# mholt at elon dot edu

if [ -n "$1" ]
then
tomcat_dir=$1
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi

if [ -n "$2" ]
then
crawl_dir=$2
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi

if [ -n "$3" ]
then
depth=$3
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomc
at/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, i
e: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi

if [ -n "$4" ]
then
adddays=$4
else
echo "Usage: recrawl servlet_path crawl_dir depth adddays [topN]"
echo "servlet_path - Path of the nutch servlet (full path, ie: /usr/local/tomcat/webapps/ROOT)"
echo "crawl_dir - Path of the directory the crawl is located in. (full path, ie: /home/user/nutch/crawl)"
echo "depth - The link depth from the root page that should be crawled."
echo "adddays - Advance the clock # of days for fetchlist generation. [0 for n
one]"
echo "[topN] - Optional: Selects the top # ranking URLS to be crawled."
exit 1
fi

if [ -n "$5" ]
then
topn="-topN $5"
else
topn=""
fi

#Sets the path to bin
nutch_dir=`dirname $0`
echo "nutch directory :$nutch_dir"

# Only change if your crawl subdirectories are named something different
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

hadoop="/nutch/search/bin/hadoop" # hadoop command

# The generate/fetch/update cycle
for ((i=1; i <= depth ; i++))
do
$nutch_dir/nutch generate $webdb_dir $segments_dir $topn -adddays $adddays
#segment=`ls -d $segments_dir/* | tail -1`
segment_tmp=`$hadoop dfs -ls $segments_dir | tail -1`
segment_tmp_len=`expr length "$segment_tmp"`
segment_tmp_end=`expr $segment_tmp_len - 6`
segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
echo "fetch update segment :$segment"
echo "fetch update segment_tmp :$segment_tmp"

$nutch_dir/nutch fetch $segment
$nutch_dir/nutch updatedb $webdb_dir $segment
done

# Merge segments and cleanup unused segments
mergesegs_dir=$crawl_dir/mergesegs_dir
$nutch_dir/nutch mergesegs $mergesegs_dir -dir $segments_dir

#for segment in `ls -d $segments_dir/* | tail -$depth`
for segment_tmp in `$hadoop dfs -ls $segments_dir | tail -$depth`
do
segment_tmp_len=`expr length "$segment_tmp"`
segment_tmp_end=`expr $segment_tmp_len - 6`
segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
echo "Removing Temporary Segment: $segment"
#rm -rf $segment
$hadoop dfs -rmr $segment
done

#cp -R $mergesegs_dir/* $segments_dir
#rm -rf $mergesegs_dir
$hadoop dfs -cp $mergesegs_dir/* $segments_dir
$hadoop dfs -rmr $mergesegs_dir

# Update segments
$nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir

# Index segments
new_indexes=$crawl_dir/newindexes
#segment=`ls -d $segments_dir/* | tail -1`
segment_tmp=`$hadoop dfs -ls $segments_dir | tail -1`
segment_tmp_len=`expr length "$segment_tmp"`
segment_tmp_end=`expr $segment_tmp_len - 6`
segment=`expr substr "$segment_tmp" 1 $segment_tmp_end`
echo "Index segment :$segment"
$nutch_dir/nutch index $new_indexes $webdb_dir $linkdb_dir $segment

# De-duplicate indexes
$nutch_dir/nutch dedup $new_indexes

# Merge indexes
$nutch_dir/nutch merge $index_dir $new_indexes

# Tell Tomcat to reload index
touch $tomcat_dir/WEB-INF/web.xml

# Clean up
#rm -rf $new_indexes
$hadoop dfs -rmr $new_indexes

echo "FINISHED: Recrawl completed. To conserve disk space, I would suggest"
echo " that the crawl directory be deleted once every 6 months (or more"
echo " frequent depending on disk constraints) and a new crawl generated."

分享到：

搜索引擎名人堂之Doug Cutting | Nutch开源搜索引擎与Paoding中文分词用pl ...

2008-09-26 19:12
浏览 5190
评论(6)
分类:互联网
查看更多

6 楼 lovepoem 2011-06-15

能增量吗?是不是还是把所有的url遍历出来。和以前的对比。算是增量索引了吗？

5 楼 freespace 2010-08-25

目前之用的本地索引，分布式的未用上。

4 楼 SeanHe 2009-11-09

libinwalan 写道

另外官网说
Setting adddays at 31 causes all pages will to be recrawled.
这个就让我更不解了
本来认为是多少天以内。但是为啥31就是全部了。
不解啊，哎。

adddays, which is useful for forcing pages to be retrieved even if they are not yet due to be re-fetched.
The page re-fetch interval in Nutch is controlled by the configuration property db.default.fetch.interval,
and defaults to 30 days. The adddays arguments can be used to advance the clock for fetchlist generation
(but not for calculating the next fetch time), thereby fetching pages early.
简单的说"adddays"被用来增大当前时间来生成抓取列表，但是不用来计算下一次抓取时间。因为nutch默认是在30天以后对同一个页面进行重新抓取，这个值配置在nutch-default.xml
<property><name>db.default.fetch.interval</name><value>30</value><description>(DEPRECATED) The default number of days between re-fetches of a page.
</description></property>

3 楼 libinwalan 2009-02-27

另外官网说
Setting adddays at 31 causes all pages will to be recrawled.
这个就让我更不解了
本来认为是多少天以内。但是为啥31就是全部了。
不解啊，哎。

2 楼 libinwalan 2009-02-27

你好参数adddays 翻译为距当前时间的日期增量天数
但是它具体是什么意思呢一直理解不到
可否解释下谢谢

1 楼 a496649849 2009-02-27

请问一下我为什么我执行以上脚本怎是报
Fetcher: java.io.IOException: Segment already fetched!at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs(Fetcher
谢谢！！

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论