Nutch学习笔记部署搜索服务(Tomcat)

nhy520

浏览: 958010 次
性别:
来自: 北京

最近访客更多访客>>

yunzhu

k0521klb

remote_silence

prog

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

搜索引擎学习

Tomcat 搜索引擎正则表达式 Apache performance

crawl完成后，就可以部署到tomcat，提供搜索引擎服务了。步骤如下：

1. 安装WAR文件
   将WAR文件$nutch$/nutch-*.war拷贝到目录$tomcat$/webapps/,
   cp $nutch$/nutch-*.war $tomcat$/webapps/nutch.war
   这样就可以通过URL: http://127.0.0.1:8080/nutch 来打开搜索主页面

   如果是保存为ROOT.war, 对应的URL为http://127.0.0.1:8080
   cp $nutch$/nutch-*.war $tomcat$/webapps/ROOT.war

2. 指定搜索数据目录
   需要为搜索服务程序指定数据文件的位置。
   假设WAR文件保存为nutch.war，重启动Tomcat，解压缩成目录$tomcat$/webapps/nutch/。
   打开文件$tomcat$/webapps/nutch/WEB-INF/classes/nutch-site.xml，添加searcher.dir
   属性，例如数据文件保存在/local/nutch/crawl目录中，则添加：

   <property>
      <name>searcher.dir</name>
      <value>/local/nutch/crawl</value>
   </property>

   这样search.jsp就知道数据文件的在哪里了。

3. 让Tomcat支持中文输入
   如果要用中文词汇做为关键词来搜索，Tomcat必须要支持中文输入。为此必须修改tomcat的
   配置文件$tomcat$/conf/server.xml, 在端口8080上的Connector中加入两个属性URIEncoding
   和useBodyEncodingForURI。代码如下：

    <Connector port="8080" maxHttpHeaderSize="8192"
               maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
               enableLookups="false" redirectPort="8443" acceptCount="100"
               connectionTimeout="20000" disableUploadTimeout="true"
               URIEncoding="UTF-8" useBodyEncodingForURI="true"/>

4. 如果要搜索大型网站，例如网络门户，还需要修改一些配置，因为缺省配置是搜索intranet的。
   修改db.max.outlinks.per.page，它定义一个网页的最大link数，超过此数的链接都要被忽略掉。缺省是100，改为1000足够了。

<property>
<name>db.max.outlinks.per.page</name>
<value>1000</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>

   修改urlfilter.order，指定URL过滤器的顺序。作者比较喜欢用正则表达式，所以设置为org.apache.nutch.urlfilter.regex.RegexURLFilter。
<property>
<name>urlfilter.order</name>
<value>org.apache.nutch.urlfilter.regex.RegexURLFilter</value>
<description>The order by which url filters are applied.
If empty, all available url filters (as dictated by properties
plugin-includes and plugin-excludes above) are loaded and applied in system
defined order. If not empty, only named filters are loaded and applied
in given order. For example, if this property has value:
org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
then RegexURLFilter is applied first, and PrefixURLFilter second.
Since all filters are AND'ed, filter ordering does not have impact
on end result, but it may have performance implication, depending
on relative expensiveness of filters.
</description>
</property>

5. 再次重启Tomcat
   用浏览器打开URL: "http://127.0.0.1:8080/nutch", 大功告成，现在开始enjoy nutch。

分享到：

Nutch源码学习系列之一 | Nutch package 下的build.xml解读

2009-05-10 01:30
浏览 2530
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论