nutch的安装（转）

zhangxiang390

浏览: 217740 次
性别:
来自: 武汉

最近访客更多访客>>

fuhongyao

beck5859509

bright60

leyou

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Nutch专辑

Ntuch Tomcat lucene 搜索引擎 Hadoop

nutch作为开源代码，为热爱搜索引擎的开发人员们提供了很好的学习平台，0.8版本开始，采用了Hadoop作为自己的分布式文件系统，更是把nutch同其他开源搜索引擎的差距拉开。 ntuch提供了一个高效、开源、易操作的搜索引擎，内部有许多细微之处都是值得借鉴的，例如采用了hadoop的分布式文件系统，类似eclipse 的插件技术，apache的httpclient来访问网站，org.cyberneko.html得HtmlParse来解析页面等等。

   nutch的官方网站：http://lucene.apache.org/nutch/
   nutch的入门文章：http://lucene.apache.org/nutch/tutorial8.html

   以下详细的介绍一下nutch0.8的安装方法：

一、环境：
      1.操作系统：windowsXP, windows2000+
      2.javaVM：java1.5.x，设置JAVA_HOME到环境变量
      3.cygwin,当然这个不是必需的，只是nutch提供的脚本只能在shell环境下使用，所以使用cygwin来虚拟shell命令。
      4.nutch版本：0.9+
      5.tomcat：5.0+

二、cygwin的安装：

      cygwin的安装不再介绍安装步骤，只介绍安装后需要如何判断是否能够使用：在cygwin的安装目录下，查找x:\cygwin\cygwin\bin\sh.exe，存在此命令即可使用。
      cygwin在删除后会发现无法再次成功安装的问题，可以通过注册表内的查找功能，删除所有包含cygwin内容的键值即可。

三、nutch的安装和配置：

1。从 http://lucene.apache.org/nutch/release/下载0.9或更高的版本，解压缩后，放置到某个目录下(如d:)。

2。在nutch/bin下，建立urls目录，然后建立一个url.txt文件，在url.txt文件内写入一个希望爬行的url，例如：http://www.sina.com.cn

3。打开nutch\conf\crawl-urlfilter.txt文件，把MY.DOMAIN.NAME字符替换为url.txt内的url的域名，其实更简单点，直接删除MY.DOMAIN.NAME这几个字就可以了，也就是说，只保存+^http://([a-z0-9]*\.)*这几个字就可以了，表示所有http的网站都同意爬行。

4 。打开nutch\conf\conf/nutch-site.xml文件，在<configuration></configuration>内插入一下内容：

<name>http.agent.name</name>

<description>HTTP 'User-Agent' request header. MUST NOT be empty -

please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents

http.agent.description

http.agent.url

http.agent.email

http.agent.version

and set their values appropriately.

</description>

</property>

<name>http.agent.description</name>

<description>Further description of our bot- this text is used in

the User-Agent header. It appears in parenthesis after the agent name.

</description>

</property>

<name>http.agent.url</name>

<description>A URL to advertise in the User-Agent header. This will

appear in parenthesis after the agent name. Custom dictates that this

should be a URL of a page explaining the purpose and behavior of this

crawler.

</description>

</property>

<name>http.agent.email</name>

<description>An email address to advertise in the HTTP 'From' request

header and User-Agent header. A good practice is to mangle this

address (e.g. 'info at example dot com') to avoid spamming.

</description>

</property>

把<name>XXX</name>之间的内容替换为其他字符，当然就算是不替换也无所谓，这里的设置，是因为nutch遵守了robots协议，在获取response时，把自己的相关信息提交给被爬行的网站，以供识别。

以上配置，是爬取intranet的配置方式。

四、执行nutch

   由于配置nutch采用的是单独网站的配置方式，所以执行上我们也采用的是单网查询，全网查询在以后的内容中介绍。

   先看一看nutch给出的命令：nutch crawl urls -dir crawl -depth 3 -topN 50
   crawl：通知nutch.jar，执行crawl的main方法。
   urls：存放需要爬行的url.txt文件的目录，注意，这个名字需要和你的文件夹目录相同，如果你的文件夹为search，那这里也应该改成search。
   -dir crawl：爬行后文件保存的位置，可以在nutch/bin目录下找到。
   -depth 3：爬行次数，或者成为深度，不过还是觉得次数更贴切，建议测试时改为1。
   -topN 50：一个网站保存的最大页面数。

      执行命令的步骤：
      1。进入cygwin界面。
      2。使用cd命令，进入nutch\bin路径下。(如：cd /cygdrive/d/nutch-0.9/bin)
      3。执行：sh nutch crawl urls -dir crawl -depth 3 -topN 50

   具体的爬行日志可以在nutch/logs目录下看到，注意查找“INFO fetcher.Fetcher - fetching http://XXXXXXX”这样的内容，这里是抓去过程日志。

五、查询搜索：
nutch提供了类似google、baidu的网搜索页面，在nutch压缩包下找到 nutch-0.9.war文件，放到tomcat/webapps目录下，修改webapps/nutch/WEB-INF/classes /nutch-site.xml文件内容如下：

<property>
<name>searcher.dir</name>
<value>d:\\nutch-0.9\\bin\\crawl</value>
</property>

<value/>的内容是刚才爬行后的crawl目录位置，提供给客户端来查询。

　　配置完成后，启动tomcat，输入http://localhost:8080/nutch-0.9，输入关键字，就会看到结果。

中文可能会出现乱码的问题，这个问题其实和 Nutch 关系不大，主要原因是使用 Tomcat 5.0 的问题。解决办法是修改 Tomcat 的 server.xml 文件的 connnector：

<Connector port="8080"
    maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
    enableLookups="false" redirectPort="8443" acceptCount="100"
    connectionTimeout="20000" disableUploadTimeout="true" 
    URIEncoding="UTF-8" useBodyEncodingForURI="true" />

其中 URIEncoding="UTF-8" useBodyEncodingForURI="true" 是需要新增的。否则搜索栏输入的字符默认编码将不能正确解析。

分享到：

nutch工作原理剖析 | 整理的first class国际会议

2008-10-24 09:25
浏览 1656
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论