nutch 1.2 hadoop 错误解决Stopping at depth=0 - no more URLs t== -

fei33423

浏览: 5554 次
性别:
来自: 杭州

最近访客更多访客>>

woodding2008

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

nutch 1.2 hadoop 错误解决Stopping at depth=0 - no more URLs t==

博客分类：

hadoop nutch

hadoop linux nutch

Stopping at depth=0 - no more URLs to fetch

看了好多版本的lnutch-1.2/conf/crawl-urlfilter.txt修改

从国内的：

urls/url.txt 或# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*163.com/

者是urls/urllist.txt

http://www.163.com/

到国外的appache

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*appache.com/

urls/url.txt 或者是urls/urllist.txt

http://www.appache.com/

看到有人说：nutch0.9 的url.txt要改成两个相同地址，只有一个nutch会忽然掉第一个地址，所以就没了。

也照做了，还是没有办法。

（http://www.cnblogs.com/ansen/articles/2055778.html ）

最后出现一个想法，因为hadoop的每个机器上的代码都要一致的。

之前已经配好了hadoop，已经启动了。

再配置crawl-urlfilter.txt的时候，我没有scp到其他的linux下。所以我尝试着把crawl-urlfilter.txt文件scp到其他的linux机器对应目录上。再次

hdfs urls/url.txt里的内容是：

http://www.163.com/
http://www.163.com
http://www.163.com/
http://www.163.com/

$ nutch crawl  urls -dir crawl -depth 3 -topN 10

果然不再出现上述错误了。

但是出现了Stopping at depth= - no more URLs to fetch的错误，显然上面没能解决问题。

Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.

我考虑是不是crawl-urlfilter.txt是不配置错了，另一方便也由于自己正则表达式不太了解。所以决定搜索下crawl-urlfilter.txt。。

在google中搜索“crawl-urlfilter.txt nutch” ，发现了一篇文章，

crawl-urlfilter.txt 和 regex-urlfilter.txt 的不同http://hi.baidu.com/kaffens/blog/item/769bb32ac4ec8628d52af17a.html

指出需要配置：urlfilter.regex.file 以覆盖默认的nutch-default.xml 内的urlfilter.regex.fil值。

故复制nutch-default.xml关于urlfilter.regex.fil这一段（搜索即可定位到）写入nutch-site.xml文件下

$vi nutch-site.xml

在<configuration> </configuration>之间写入下面内容

<property>
<name>urlfilter.regex.file</name>
<value>crawl-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>

另外避免动态网站被屏蔽，修改crawl-urlfilter.txt 中

写道

# skip URLs containing certain characters as probable queries, etc.
+[?*!@=]

（-改+）

见 http://a280606790.iteye.com/blog/833607

再重新搜索。

$ nutch crawl  urls -dir crawl -depth 3 -topN 10

完成后用命令：

bin/nutch readdb crawl/crawldb -dump tmp/entiredump
bin/hadoop dfs -copyToLocal tmp/entiredump /home/lf/output/entiredump
less /nutch/search/output/entiredump/*

注意crawl 必须要和nutch 搜索时的 -dir crawl目录保持一致

如果能看到内容，说明搜索成功

1
顶

15
踩

分享到：

test | android1.5 2.1 2.2 android 2.3源代码包导 ...

2011-07-30 14:13
浏览 2629
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

nutch 1.2 hadoop 错误解决Stopping at depth=0 - no more URLs t==

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

nutch 1.2 hadoop 错误解决Stopping at depth=0 - no more URLs t==

评论

发表评论

相关推荐

最近访客更多访客>>