nutch 分布式搜索-cluster-hdfs index

leibnitz

浏览: 286149 次
性别:
来自: 广州

最近访客更多访客>>

eternal1025

bneliao

adapterofcoms

caipeijun666

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

nutch
java
search

此过程也很简单，步骤是：

a.put the indexes to hdfs respectively;

b.let search server three xml files related with hdfs,core,mapred -site.xml be same with the hadoop-slave's respectively;

c.retrieve the path of index in hdfs,then use them start the search server one by each;

d.start web container

note:

有人说用了分布式搜索后，每次查询都生成一个mr，性能会很差...

我觉得这些人没弄懂hadoop的动作机制，真是的胡说八道。

client通过 rpc向search-servers請求，然后交由servers来做真正的搜索任务，当然还是用到lucene的功能来实现。而hadoop向lucene提供了透明的文件流存取，根本不会开mr来实现！

如果还不相信，可以只开启start-dfs.sh便得以验证。

Nutch search RPC 调用原理

1.client端先获取一个RPCSearchBean proxy，然后在调用search(Query)时，由先将query中的参数及名字等转换为RPC.Invocation，然后对它封送(serialized parameters),然后通过 socket传送到由search-server.txt定义好了第一台server（remote)中；

2.remote端启动一个驻守thread:DistributedSearch$Server，用于处理client的requests.过程是：

a。通过消费者模式产生listener,hander(s),responser.其中handler(s)负责将calls去调用本地的NutchBean相应方法，当然了。这个过程需要在bytestream中对参数deserialized 为Invocation,然后根据其中的class name,method name等参数进行local invoke。

3.responer将上述的結果反封送为byte stream并送给client，然后由proxy deserialized为真正方法的返回object。

4.根据 search-servers.txt重复其它的servers

5.对所有的結果进行整合。完毕

see

cluster-local

hdfs data flow-part reading

分享到：

nutch 分布式索引（爬虫) | nutch 分布式搜索-cluster-local index

2011-10-17 02:14
浏览 1377
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论