linux下安装nutch-1.0--内部网络爬虫和检索的实现

luckaway

浏览: 138757 次
性别:
来自: 杭州

最近访客更多访客>>

yinbangmin

gaofeng_monica

王余白

huhengbin

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

搜索引擎

Linux 全文检索 Tomcat lucene XSL

Nutch是一个完整的开源全文检索软件，它是建立在lucene java之上增加，增加了一些web特性,
如网络爬虫,link-graph数据库,HTML文本解析和其他格式文档解析,等等。

下载nutch

1.选择安装nutch的目录，我就直接安装到/home/admin下

[root@search-test1 ~]# cd /home/admin/

2.下载nutch-1.0：

[root@search-test3 admin]# wget "http://labs.xiaonei.com/apache-mirror/lucene/nutch/nutch-1.0.tar.gz"

3.解压nutch-1.0.war,建立软链

[root@search-test3 admin]# tar -zxf nutch-1.0.tar.gz 
[root@search-test3 admin]# ln -s nutch-1.0 nutch

/home/admin下nutch的目录列表

[root@search-test3 admin]# ll|grep 'nutch'
lrwxrwxrwx 1 root root        9 01-12 14:57 nutch -> nutch-1.0
drwxr-xr-x 9 root root     4096 2009-03-24 nutch-1.0
-rw-r--r-- 1 root root 86557549 2009-03-28 nutch-1.0.tar.gz

内部爬虫的配置

1.在/home/admin/nutch下建立一个urls目录，在urls下建立一个taizhou.txt,爬台州的一个网站（很多大的网站对这中野爬虫都做了屏蔽，最后才选择了taizhou.com）。

[root@search-test3 nutch]# mkdir /home/admin/nutch/urls;touch /home/admin/nutch/urls/taizhou.txt
.....
[root@search-test3 nutch]# cat /home/admin/nutch/urls/taizhou.txt
http://www.taizhou.com

2.编辑conf/crawl-urlfilter.txt，替换“MY.DOMAIN.NAME”为“taizhou.com”，如下所示：

+^http://([a-z0-9]*\.)*taizhou.com/

3.编辑conf/nutch-site.xml，配置爬虫携带的http头的信息，这里只是部分属性

[root@search-test3 conf]# cat nutch-site.xml   
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value>8qiu-spider</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

        http.robots.agents
        http.agent.description
        http.agent.url
        http.agent.email
        http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.agent.description</name>
  <value>this is a crawler of 8qiu</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>www.8qiu.com</value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>javalover@yeah.net</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>
</configuration>

4.启动爬虫程序

/home/admin/nutch/bin/nutch crawl /home/admin/nutch/urls/ -dir /home/admin/nutch/crawl -depth 3 -topN 100

安装Web运行环境
1.安装tomcat,我的tomcat目录/usr/local/tomcat

2.把nutch.1.0的war包移到webapp目录下

mv nutch-1.0.jar /usr/local/tomcat/webapps/

3.启动tomcat

[root@search-test3 nutch]# /usr/local/tomcat/bin/startup.sh
Using CATALINA_BASE:   /usr/local/tomcat
Using CATALINA_HOME:   /usr/local/tomcat
Using CATALINA_TMPDIR: /usr/local/tomcat/temp
Using JRE_HOME:       /usr/local/jdk1.6.0_10

必须要在/home/admin/nutch下敲如下命令，切记，否则它会找不到/home/admin/nutch/crawl目录

启动完成之后，检查一下tomcat的日子：/usr/local/tomcat/logs/catalina.out

如果一切都正常， http://192.168.110.12:8080/nutch-1.0/search.jsp，就能搜索到结果了

分享到：

常用正则表达式（转） | tomcat源码—redirect和forward的实现

2010-01-12 15:56
浏览 4309
评论(0)
分类:非技术
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论