`

在Eclipse中运行Nutch1.0

阅读更多

Run Nutch In Eclipse on Linux and Windows nutch version 1.0

Tested with

·         Nutch release 1.0

·         Eclipse 3.3

·         Java 1.6

·         Ubuntu (should work on most platforms though)

·         Windows XP

Steps

For Windows Users

If you are running Windows (tested on Windows XP) you must first install cygwin

Download cygwin from http://www.cygwin.com/setup.exe

You can learn how to install cygwin from Internet, I will omit the steps of installing here.

After installing cygwin, you can follow rest of these steps.

Install Nutch

·         Grab a fresh release of nutch 1.0 - http://lucene.apache.org/nutch/version_control.html

·         Set NUTCH_HOME(the location you download the nutch1.0) in environment variables.

·         Set NUTCH_JAVA_HOME(the same place as JDK1.6) in environment variables.

·         Do not build Nutch now. Make sure you have no .project and .classpath files in the Nutch directory

Create a new java project in Eclipse

·         File > New > Project > Java project > click Next

·         Name the project (Nutch for instance)

·         Select "Create project from existing source" and use the location where you downloaded nutch-1.0

·         Click on Next, and wait while Eclipse is scanning the folders

·         Add the folder "conf" to the classpath (third tab and then add class folder)

·         Go to "Order and Export" tab, find the entry for added "conf" folder and move it to the top.

·         Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries

·         Set output dir to "tmp_build", create it if necessary

·         DO NOT add "build" to classpath

Configure Nutch

1.    Open up $NUTCH_HOME/conf/nutch-site.xml file , add the following content in it:

 

<configuration>
        <property>
                <name>http.agent.name</name>
                <value>my nutch agent</value>
        </property>


        <property>
                <name>http.agent.version</name>
                <value>1.0</value>
        </property>

 

<property>

         <name>plugin.folders</name>

         <value>E:/nutch-1.0/src/plugin</value>

  </property>

</configuration>

 

Note: Here I set the value of “plugin.floders” with absolute path, you can also use a relative path.

2. Optionally you may also set http.agent.url and http.agent.email properties.

3. Make sure Nutch is configured correctly before testing it into Eclipse

 

Missing org.farng and com.etranslate

Eclipse will complain about some import statements in parse-mp3 and parse-rtf plugins (30 errors in my case). Because of incompatibility with the Apache license, the .jar files that define the necessary classes were not included with the source code.

Download them here:

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/

Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder > Build Path > Configure Build Path... Then select the Libraries tab, click "Add Jars..." and then add each .jar file individually).

Build Nutch

If you setup the project correctly, Eclipse will build Nutch for you into "tmp_build". See below for problems you could run into.

Create Eclipse launcher

.Menu Run->Open Run Dialog.., choose the right project name, and

Set the main class

org.apache.nutch.crawl.Crawl

on tab Arguments, Program Arguments

urls -dir crawl -depth 3 -topN 50 -threads 10

Here: “urls” is the directory in which we write the webpages we want to crawl

·         -dir dir names the directory to put the crawl in.

·         -threads threads determines the number of threads that will fetch in parallel.

·         -depth depth indicates the link depth from the root page that should be crawled.

·         -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

 

in VM arguments

-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log

Java Heap Size problem

If you find in hadoop.log line similar to this:

2009-05-09 14:03:09,640 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space

You should increase amount of RAM for running applications from eclipse.

Just set it in:

Eclipse -> Window -> Preferences -> Java -> Installed JREs -> edit -> Default VM arguments

I've set mine to

-Xms5m -Xmx150m

-Xms (minimum ammount of RAM memory for running applications) -Xmx (maximum)

 

References:

http://wiki.apache.org/nutch/RunNutchInEclipse0.9

http://wiki.apache.org/nutch/NutchTutorial

 

 

分享到:
评论

相关推荐

    Eclipse中编译Nutch-1.0

    随后,在Eclipse中创建一个新的Java项目,命名可自定义,如“Nutch”,并在创建过程中选择“从现有源代码创建项目”选项,指向已解压的`nutch-1.0`目录。在项目的Library配置中,需添加`conf`文件夹,并将其置于...

    Nutch1.0的API chm格式

    Nutch1.0的API,不过注意没有搜索功能

    eclipse配置nutch,eclipse配置nutch

    为了使Nutch在Eclipse中正常运行,你需要修改Nutch的默认配置文件`nutch-default.xml`。具体而言,将`plugin.folders`参数改为`"./src/plugin"`,以确保Nutch能够识别到项目的插件目录。 #### 步骤5:设置URL文件 ...

    Eclipse中编译Nutch-0.9

    - **创建Java Project**:在Eclipse中创建一个新的Java Project,命名为"Nutch",并选择“Create project from existing source”,指向解压后的Nutch-0.9目录。 #### 解决编译错误与外部库集成 - **识别编译错误*...

    开源搜索引擎nutch-1.0.part01.rar

    Nutch 是一个开源的、Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。

    Nutch1.0:Nutch1.0修改版(整合中文分词)原始码修改,编译打包-修改

    在这个特定的“Nutch1.0修改版”中,开发者已经对原始代码进行了调整,以整合中文分词功能。这使得Nutch能够更有效地处理中文网页的抓取和索引,从而在中文搜索引擎应用中发挥更大的作用。 中文分词是中文信息处理...

    在eclipse中部署nutch所缺的包

    在Eclipse中部署Apache Nutch时,可能会遇到缺少特定库文件的问题,这通常是由于Nutch的依赖管理没有完全覆盖所有必需的组件。本教程将详细解释如何解决在Eclipse环境中部署Nutch时遇到的关于MP3和RTF文件解析的缺失...

    nutch Eclipse

    在本文中,我们将深入探讨如何在 Linux 环境下使用 Eclipse 编译 Apache Nutch 1.0。Apache Nutch 是一个开源的网络爬虫框架,主要用于抓取和索引网页内容。Eclipse 是一个广泛使用的 Java 开发集成环境,它支持多种...

    Nutch 1.0part6

    Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。

    Nutch1.0 part4

    Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。

    Nutch1.0part5

    Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。

    nutch-1.0part1

    Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。

    ant-eclipse-1.0.bin.tar.bz2

    ant-eclipse-1.0 nutch

    开源搜索引擎nutch-1.0.part09.rar

    Nutch 是一个开源的、Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。 nutch 1.0

    开源搜索引擎nutch-1.0.part08.rar

    Nutch 是一个开源的、Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。 nutch 1.0

    开源搜索引擎nutch-1.0.part07.rar

    Nutch 是一个开源的、Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。 nutch 1.0

    debugNutchWithEclipse

    ### 在Eclipse中调试Nutch 1.0 #### 概述 本文旨在提供一个详细的指南,帮助用户在Eclipse IDE环境下配置并调试Nutch 1.0版本。Nutch是一款开源的网络爬虫框架,它能够抓取、索引互联网上的网页,并支持多种插件...

    Nutch-1.0分布式安装手册.rar

    Nutch-1.0分布式安装手册是一份详细指导如何在多台计算机上部署和配置Apache Nutch的文档。Apache Nutch是一款开源的网络爬虫软件,用于抓取互联网上的网页并进行索引,是大数据领域中搜索引擎构建的重要工具。这份...

Global site tag (gtag.js) - Google Analytics