(1)nutch1.0 安装

zhouxianglh

浏览: 269947 次
性别:
来自: 深圳

最近访客更多访客>>

Achilles12345

oscar_film

cqh520llr

wushaoen

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Tomcat XSL XML lucene C#

1 nutch1.0 的配置
环境准备：
1.1. 下载NUTCH 1.0 推荐使用国内的镜像站点：
http://labs.xiaonei.com/apache-mirror/lucene/nutch/
1.2. 环境变量添加：NUTCH_JAVA_HOME 指向JRE所在路径
1.3. 准备windows下的Linux虚拟环境，这里使用“Cygwin”下载地址：
http://www.cygwin.com/setup.exe
配置过程
1.4. 解压NUTCH 1.0，如C:\nutch
1.5. 在C:\nutch 下新建路径urls\并新建文件nutch.txt（也可以是其他名字的文件）并在nutch.txt中写入要抽取的站点地址.如:www.google.com\ 注意：最后一行的后面要有换行
1.6. 打开C:\nutch\conf\crawl-urlfilter.txt 文件
找到：

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/

修改为：

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*google.com/

1.7. 打开C:\nutch\conf\nutch-site.xml

修改为：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
configuration>
<!-- HTTP properties -->
<property>
<name>http.agent.name</name>
<value>*</value>
<description></description>
</property>
<!-- file properties -->
<property>
<name>searcher.dir</name>
<!--  索引文件路径  -->
<value>C:\nutch\localweb</value>
<description></description>
</property>
</configuration>

1.8. 运行安装好的Cygwin，进入到nutch路径下的bin路径,执行命令$sh nutch crawl ../urls -dir ../localweb -depth 2 -threads 20
参数说明：
         crawl：通知nutch.jar，执行crawl的main方法。
         urls：存放需要爬行的url.txt文件的目录
         -dir ../localweb爬行后文件保存的位置
         -depth 4：爬行深度。
        -threads : 指定并发的进程这里设定为5
        - topN ：一个网站保存的最大页面数。
1.9.   找到C:\nutch\nutch-1.0.war 文件，复制到tomcat安装路径的webapp下。
1.10. 停止tomcat，修改webapp路径下nutch-1.0\WEB-INF\classes\nutch-site.xml文件为：

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- HTTP properties -->
<property>
<name>http.agent.name</name>
<value>*</value>
<description></description>
</property>
<!-- file properties -->
<property>
<name>searcher.dir</name>
<!--  索引文件路径 -->
<value>C:\nutch\localweb</value>
<description></description>
</property>
</configuration>

1.11.为了不出现中文乱码问题，修改TOMCAT安装路径下的配置文件：\conf\server.xml

找到项目<Connector 增加属性URIEncoding="UTF-8" useBodyEncodingForURI="true"
增加后的结果为：

<!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
    <Connector  URIEncoding="UTF-8" useBodyEncodingForURI="true".......

1.12.启动TOMCAT ，访问http://127.0.0.1:8080/nutch-1.0/ 即可。

以上参考http://hi.baidu.com/doingwell/blog/item/6667d24efcead000b3de058b.html

分享到：

(2)Nutch1.0 浅析 | JNI学习-C 调用Java

2010-03-29 11:35
浏览 1661
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

(1)nutch1.0 安装

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

(1)nutch1.0 安装

评论

发表评论

相关推荐

最近访客更多访客>>