Nutch2.1 in eclipse

cosmo1987

浏览: 83084 次
性别:
来自: 杭州

最近访客更多访客>>

cjm123s

azrael6619

朱尉铭

伊苏

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

java
nutch

linux nutch java

Nutch2.1 in eclipse

主要目的：
1. 将nutch2.1放入eclipse中，便于调试源代码，查看nutch2.1是如何实现的。
2. 方便学习编写nutch2.1的plugin

准备:
Linux环境
Nutch2.1
Mysql
Java1.6
Eclipse

开始：
首先需要安装好jdk1.6，mysql，eclipse
开启eclipse，使用market place下载ivyDE，subeclipse
在首先进入/etc/my.cnf
在[mysqld]
下添加：
innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
开启mysql服务器
修改root用户密码为root

创建数据库：

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;

创建用户表：

CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;

1.安装nutch2.1
File > New > Project > SVN > Checkout Projects from SVN
Create new repository location > https://svn.apache.org/repos/asf/nutch/tags/release-2.1
checkout 源代码 as a project configured using the New Project Wizard
最后点击finish根据提示选择Java > Java Project > next
为项目取个名字（随意取，无限制，本人这里取nutch2.1，下面会用到）其他全部按照默认走就可以了，下载nutch2.1的源代码。

2.创建nutch2.1的编译环境
在project explorer下右击项目，选择properties。进入java build path
The only Source folder will be nutch2.1（之前自己取的名字） /src> Remove this folder > Add Folder > expand trunk/src and check src/bin, src/java, src/test & src/testresources.

我们必须手动添加 plugin src/java and src/test folder,虽然这个户花费我们不少时间，但是这个是必须要做的。
在 Libraries tab, 点击 Add Class Folder 并添加 nutch2.1/conf 到 classpath.
还是在 Libraries tab add JARs > src/plugin/urlfilter-automaton/lib/automaton.jar & src/plugin/parse-swf/lib/javaswf.jar

在 Libraries tab Add Library > IvyDE Managed Dependencies > browse to nutch2.1/ivy/ivy.xml
"Order and Export" tab 找到src/conf选中并点击top，移动到最顶端。

这些配置完成。IvyDE会自动加载依赖的jar包。可能会出现报错（如果网络不好的话）
然后就算没有报错，我们仍然可以看到nutch2.1中有很多红叉。

可以先放着。接下来要做的是配置编译环境变量。
在nutch2.1/conf下
Gora.properties
加入：

gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=root
gora.sqlstore.jdbc.password=root

并注释掉其他的数据库链接。
在ivy/ivy.xml
解除mysql-connector的注释。

在/conf/nutch-site.xml.template的configuration中添加如下代码：

<property>
<name>http.agent.name</name>
<value>Your Nutch Spider</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
  <name>plugin.includes</name>
 <value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
 <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
  </description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

<property>
  <name>plugin.folders</name>
  <value>./src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

在根目录下的build.xml中找到如下代码

<target name="resolve-default" depends="clean-lib, init" description="--> resolve and retrieve dependencies with ivy">
  <ivy:resolve file="${ivy.file}" conf="default" log="download-only" />
  <ivy:retrieve pattern="${build.lib.dir}/[artifact]-[revision].[ext]" symlink="false" log="quiet" />
  <antcall target="copy-libs" />
 </target>

将原本的

pattern="${build.lib.dir}/[artifact]-[revision].[ext]"

改为

pattern="${build.lib.dir}/[artifact]-[type]-[revision].[ext]"

用来避免ivy再次下载编译不通过的情况。原因：ivy会下载class的jar和source的jar，当时如果直接按照上面的pattern下载的话，两个文件是无法区分的。会出现相同的文件的错误。

完成如上信息之后，点击build.xml进行ant编译就会生成runtime目录。

3. 创建debug环境。
由于我们的代码中有很多红叉，现在我们就来消除它。

首先，我们需要去runtime/local/plugins下的各个包中寻找没有加的jar这些jar是各个plugin自己需要的，通过他们自己的ivy加载进来的。用来消除红叉，到最后会发现肯定还是有那么四个总是红色的。Parser-js，Parser-ext, Parser-swf, Parser-zip
这个就是nutch项目的问题了。因为他们引用的是nutch1.X的包中的类，所以这里有错误。源代码的开发者不知道为什么没有把这些更新。一般我们做初步调试时不需要用到这些插件的。所以可以直接从source path将他们remove。

在根目录下添加一个urls文件夹，放入seed.txt文件，其中加一个网站地址。如：http://nutch.apache.org/
打开src/java下的crawl的package下的crawler，使用run configuration
第一页已经默认填写完毕
选择第二个arguments
放入：urls -depth 3 -topN 5
最后就可以使用run进行爬取该网站的链接信息了。

参考文献：
http://nlp.solutions.asia/?p=180
http://wiki.apache.org/nutch/RunNutchInEclipse

写在后面：
Nutch的官方文档和源代码的管理很让人失望。官方文档中有不少错误的或者说介绍不明确的地方。直接按照官方文档来操作死活弄不出来的事情时有发生。然后是源代码的管理，自nutch从1.x升级到2.x，其中除了parser-html外的所有其他parser插件如js，ext，zip，swf等都改用parser-tika了。但是这些插件却没有在nutch2.1的源码中删除，如果直接添加进来，指向不明确的话，永远都会有报错。最后就是nutch2.1的文档太少，自己需要投入相当的精力才可以把这个环境搭建好。希望官方可以好好的把这些问题解决掉。

其他补充：
1. 如何添加合适的plug
由于我们有一个plugin include在nutch-site中，只要根据这个去看哪些是我们某人需要添加的plugin 在add source 的时候可以只加这些必须的plugin。自己也可以编写plugin为自己的nutch使用。可以参考如何编写plugin的文档。

2. 如何在windows下执行。
如上信息其实大多没有linux和windows的区别。（你耍我？，那还要linux干嘛？）只是，由于nutch是基于hadoop的，而我们知道hadoop只能在linux上运行。在linux上运行没有问题，到windows下就会问题不断。其中一个重点问题就是linux和windows的路径名是不同的。所以在上面的配置中plugin folder这里的路径就要调整为src/plugin了。Mysql安装完成后，对于linux下的/etc/my.cnf是windows的mysql安装目录下面的一个my.ini文件，所以之前添加到/etc/my.cnf的信息需要添加到这个my.ini中。还有就是hadoop默认使用的是linux的路径形式。所以需要修改hadoop的源码。大家可以到网上去下载一个hadoop-1.0.2-modified.jar用来代替ivy为我们下载的hadoop-1.0.3。该jar解决了linux下的所有问题，可以用来windows下使用。所以修改完以上，我们也是可以在windows下使用nutch2.1的。

3. Mysql下的/etc/my.cnf参数复制好后就无法重启mysql了
那是因为你复制的参数在该文件中已经存在了。Mysql是不允许出现两份一样的配置的。所以检查一下有没有哪个参数在配置文件中已经有了，删除它即可。我遇到的问题就是character-set-server已经被设置成了utf8。修改成utf8mb4即可。

分享到：

Hessian 初体验 | 事务的一致性理解

2013-03-10 00:22
浏览 9491
评论(1)
分类:企业架构
查看更多

1 楼 neptunecai 2014-09-29

2.创建nutch2.1的编译环境这部分能详细些吗？
比如：The only Source folder will be nutch2.1（之前自己取的名字） /src> Remove this folder > Add Folder > expand trunk/src and check src/bin, src/java, src/test & src/testresources.
没看懂啥意思？能上图吗？

比如：我们必须手动添加 plugin src/java and src/test folder,虽然这个户花费我们不少时间，但是这个是必须要做的。
在 Libraries tab, 点击 Add Class Folder 并添加 nutch2.1/conf 到 classpath.
还是在 Libraries tab add JARs > src/plugin/urlfilter-automaton/lib/automaton.jar & src/plugin/parse-swf/lib/javaswf.jar
这能贴些图吗？谢谢

比如：在 Libraries tab Add Library > IvyDE Managed Dependencies > browse to nutch2.1/ivy/ivy.xml
"Order and Export" tab 找到src/conf选中并点击top，移动到最顶端。
能上些图吗？谢谢。

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论