`

windows 下 nutch 在eclipse下的搭建

阅读更多

转载自wiki:Run Nutch In Eclipse on Linux and Windows nutch version 1.0

 

 

This is a work in progress. If you find errors or would like to improve this page, just create an account [UserPreferences ] and start editing this page 

 

Tested with
Nutch release 1.0
Eclipse 3.3 (Europa) and 3.4 (Ganymede)
Java 1.6
Ubuntu (should work on most platforms though)
Windows XP and Vista
 

Before you start
Setting up Nutch to run into Eclipse can be tricky, and most of the time it is much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. Sometimes examining the logs (logs/hadoop.log) is quicker to debug a problem.

 

Steps
 

For Windows Users
If you are running Windows (tested on Windows XP) you must first install cygwin. Download it from
http://www.cygwin.com/setup.exe

Install cygwin and set the PATH environment variable for it. You can set it from the Control Panel, System, Advanced Tab, Environment Variables and edit/add PATH.

Example PATH:

 

C:\Sun\SDK\bin;C:\cygwin\binIf you run "bash" from the Windows command line (Start > Run... > cmd.exe) it should successfully run cygwin.

If you are running Eclipse on Vista, you will need to either give cygwin administrative privileges or turn off Vista's User Access Control (UAC) . Otherwise Hadoop will likely complain that it cannot change a directory permission when you later run the crawler:

 

org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions of ... Permission deniedSee this for more information about the UAC issue.

 

Install Nutch
Grab a fresh release of Nutch 1.0 or download and untar the official 1.0 release .

Do not build Nutch yet. Make sure you have no .project and .classpath files in the Nutch directory
 

Create a new Java Project in Eclipse
File > New > Project > Java project > click Next

Name the project (Nutch_Trunk for instance)
Select "Create project from existing source" and use the location where you downloaded Nutch
Click on Next, and wait while Eclipse is scanning the folders
Add the folder "conf" to the classpath (Right-click on the project, select "properties" then "Java Build Path" tab (left menu) and then the "Libraries" tab. Click "Add Class Folder..." button, and select "conf" from the list)
Go to "Order and Export" tab, find the entry for added "conf" folder and move it to the top (by checking it and clicking the "Top" button). This is required so Eclipse will take config (nutch-default.xml, nutch-final.xml, etc.) resources from our "conf" folder and not from somewhere else.
Eclipse should have guessed all the Java files that must be added to your classpath. If that's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
Click the "Source" tab and set the default output folder to "Nutch_Trunk/bin/tmp_build". (You may need to create the tmp_build folder.)
Click the "Finish" button
DO NOT add "build" to classpath
 

Configure Nutch
See the Tutorial

Change the property "plugin.folders" to "./src/plugin" on $NUTCH_HOME/conf/nutch-defaul.xml
Make sure Nutch is configured correctly before testing it into Eclipse 

 

Missing org.farng and com.etranslate
Eclipse will complain about some import statements in parse-mp3 and parse-rtf plugins (30 errors in my case). Because of incompatibility with the Apache license, the .jar files that define the necessary classes were not included with the source code.

Download them here:

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/

http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/

Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder > Build Path > Configure Build Path... Then select the Libraries tab, click "Add Jars..." and then add each .jar file individually. If that does not work, you may try clicking "Add External JARs" and the point to the two the directories above).

 

Two Errors with RTFParseFactory
If you are trying to build the official 1.0 release, Eclipse will complain about 2 errors regarding the RTFParseFactory (this is after adding the RTF jar file from the previous step). This problem was fixed (see NUTCH-644 and NUTCH-705 ) but was not included in the 1.0 official release because of licensing issues. So you will need to manually alter the code to remove these 2 build errors.

In RTFParseFactory.java:

Add the following import statement: import org.apache.nutch.parse.ParseResult;

Change
 

public Parse getParse(Content content) {to

 

public ParseResult getParse(Content content) {In the getParse function, replace
 

return new ParseStatus(ParseStatus.FAILED,
                               ParseStatus.FAILED_EXCEPTION,
                               e.toString()).getEmptyParse(conf);with

 

return new ParseStatus(ParseStatus.FAILED,
                ParseStatus.FAILED_EXCEPTION,
              e.toString()).getEmptyParseResult(content.getUrl(), getConf());In the getParse function, replace
 

return new ParseImpl(text,
                         new ParseData(ParseStatus.STATUS_SUCCESS,
                                       title,
                                       OutlinkExtractor.getOutlinks(text, this.conf),
                                       content.getMetadata(),
                                       metadata));with

 

return ParseResult.createParseResult(content.getUrl(),
                             new ParseImpl(text,
                                     new ParseData(ParseStatus.STATUS_SUCCESS,
                                             title,
                                             OutlinkExtractor.getOutlinks(text, this.conf),
                                             content.getMetadata(),
                                             metadata)));In TestRTFParser.java, replace

 

parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content);with

 

parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content).get(urlString);Once you have made these changes and saved the files, Eclipse should build with no errors.

 

Build Nutch
If you setup the project correctly, Eclipse will build Nutch for you into "tmp_build". See below for problems you could run into.

 

Create Eclipse launcher
Menu Run > "Run..."

create "New" for "Java Application"
set in Main class
 

org.apache.nutch.crawl.Crawlon tab Arguments, Program Arguments
 

urls -dir crawl -depth 3 -topN 50in VM arguments
 

-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.logclick on "Run"
if all works, you should see Nutch getting busy at crawling 

 

Debug Nutch in Eclipse (not yet tested for 0.9)
Set breakpoints and debug a crawl
It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs. Here are a few good places to set breakpoints:
 

Fetcher [line: 371] - run
Fetcher [line: 438] - fetch
Fetcher$FetcherThread [line: 149] - run()
Generator [line: 281] - generate
Generator$Selector [line: 119] - map
OutlinkExtractor [line: 111] - getOutlinks

If things do not work...
Yes, Nutch and Eclipse can be a difficult companionship sometimes 

 

Java Heap Size problem
If the crawler throws an IOException exception early in the crawl (Exception in thread "main" java.io.IOException: Job failed!), check the logs/hadoop.log file for further information. If you find in hadoop.log lines similar to this:

 

2009-04-13 13:41:06,105 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap spacethen you should increase amount of RAM for running applications from Eclipse.

Just set it in:

Eclipse -> Window -> Preferences -> Java -> Installed JREs -> edit -> Default VM arguments

I've set mine to

 

-Xms5m -Xmx150mbecause I have like 200MB RAM left after running all apps

-Xms (minimum ammount of RAM memory for running applications) -Xmx (maximum)

 

Eclipse: Cannot create project content in workspace
The nutch source code must be out of the workspace folder. My first attempt was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine.

 

plugin dir not found
Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absolute one as well in nutch-defaults.xml or may be better in nutch-site.xml

 

<property>
  <name>plugin.folders</name>
  <value>/home/....../nutch-0.9/src/plugin</value>

No plugins loaded during unit tests in Eclipse
During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well.

 

NOTE: Additional note for people who want to run eclipse with latest nutch code
If you are getting following exception - org.apache.nutch.plugin.PluginRuntimeException : java.lang.ClassNotFoundException : org.apache.nutch.net .urlnormalizer.basic.BasicURLNormalizer

Execute 'ant job' (which is the default) after downloading nutch through SVN
Update "plugin.folders" (under nutch-default.xml) to build/plugins (where ant builds plugins)
If it still fails increase your memory allocation or find a simpler website to crawl.
 

Unit tests work in eclipse but fail when running ant in the command line
Suppose your unit tests work perfectly in eclipse, but each and everyone fail when running ant test in the command line - including the ones you haven't modified. Check if you defined the plugin.folders property in hadoop-site.xml. In that case, try removing it from that file and adding it directly to nutch-site.xml

Run ant test again. That should have solved the problem.

If that didn't solve the problem, are you testing a plugin? If so, did you add the plugin to the list of packages in plugin\build.xml, on the test target?

 

classNotFound
open the class itself, rightclick
refresh the build dir
 

debugging hadoop classes
Sometime it makes sense to also have the hadoop classes available during debugging. So, you can check out the Hadoop sources on your machine and add the sources to the hadoop-xxx.jar. Alternatively, you can:
Remove the hadoopXXX.jar from your classpath libraries
Checkout the hadoop brunch that is used within nutch
configure a hadoop project similar to the nutch project within your eclipse
add the hadoop project as a dependent project of nutch project
you can now also set break points within hadoop classes lik inputformat implementations etc.
 

Failed to get the current user's information
On Windows, if the crawler throws an exception complaining it "Failed to get the current user's information" or 'Login failed: Cannot run program "bash"', it is likely you forgot to set the PATH to point to cygwin. Open a new command line window (All Programs > Accessories > Command Prompt) and type "bash". This should start cygwin. If it doesn't, type "path" to see your path. You should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See the steps to adding this to your PATH at the top of the article under "For Windows Users". After setting the PATH, you will likely need to restart Eclipse so it will use the new PATH.

Original credits: RenaudRichardet

 

本文来自CSDN博客,转载请标明出处:http://blog.csdn.net/DareZhang/archive/2010/03/25/5412808.aspx

分享到:
评论

相关推荐

    eclipse配置nutch,eclipse配置nutch

    本文将详细解析如何在Eclipse中配置Nutch,以便于开发者更好地理解和操作这一过程。 ### 一、理解Nutch与Eclipse的结合 Nutch是一个基于Hadoop的框架,用于构建可扩展且高性能的网络爬虫。它不仅能够抓取网页,还...

    windows下nutch的安装.pdf

    在介绍Windows下Nutch的安装过程之前,首先需要了解Nutch和Cygwin这两个工具的基本概念和作用。Nutch是一个开源的Web搜索引擎框架,基于Java编写,它使用Lucene作为搜索引擎核心。Nutch能够抓取网站并建立索引,实现...

    Windows下配置nutch

    因为 Nutch 原生设计在 Linux 环境下运行,所以在 Windows 上配置 Nutch 需要一些额外的步骤。主要有两种方法: - **在 Eclipse 中使用 Nutch**: - 创建一个新的 Java 项目。 - 导入 Nutch 源码,将 `src\java`...

    Nutch配置环境\Nutch1[1].4_windows下eclipse配置图文详解.docx

    本文将详细介绍如何在Windows环境下配置Nutch 1.4,并使用Eclipse进行开发。以下是你需要知道的关键步骤: 1. **安装JDK**: 在配置Nutch之前,首先确保已安装Java Development Kit (JDK)。这里推荐使用JDK 1.6。...

    Eclipse中编译Nutch-0.9

    本文详细介绍了在Eclipse环境下编译Nutch-0.9的完整流程,从环境搭建、项目导入,到解决编译错误、外部库集成,再到配置文件调整和最终的运行测试,每一个步骤都旨在帮助用户顺利地启动和操作这个强大的网络爬虫工具...

    Eclipse中编译Nutch-1.0

    随后,在Eclipse中创建一个新的Java项目,命名可自定义,如“Nutch”,并在创建过程中选择“从现有源代码创建项目”选项,指向已解压的`nutch-1.0`目录。在项目的Library配置中,需添加`conf`文件夹,并将其置于...

    Nutch2.3.1 环境搭建

    1. 操作系统:Nutch可以在Linux、Unix或Mac OS X等类Unix系统上运行,Windows系统也可以通过Cygwin模拟。 2. Java环境:Nutch需要Java Development Kit (JDK) 1.8 或更高版本,确保`JAVA_HOME`环境变量指向正确的JDK...

    Nutch在windows下的安装

    Nutch在windows下的安装 JDK安装 Tomcat安装 Cygwin安装

    Windows系统下Nutch检索工具的搭建步骤

    本文将详细介绍如何在Windows系统下搭建Nutch检索工具,旨在为从事信息检索方向的专业人士提供一份实用的参考指南。 #### 二、准备工作 搭建Nutch之前,需要完成一些基本的准备工作: 1. **操作系统**:推荐使用...

    Nutch1.4_windows下eclipse配置图文详解.docx

    ### Nutch 1.4 在 Windows 下 Eclipse 配置图文详解 #### 一、环境准备与配置 **1.1 JDK 安装** - **版本选择**:文档中提到使用了 JDK1.6,官方下载地址为:[JDK6]...

    Nutch在windows下的安装.pdf

    Nutch在windows下的安装.pdf

    搭建nutch web开发环境

    1. 在`conf`目录下,编辑`nutch-site.xml`配置文件,根据你的Hadoop集群设置相关参数,如`fs.defaultFS`、`mapreduce.framework.name`等。 2. 创建或修改`crawldb`、`segments`等目录路径,确保它们指向你的工作目录...

    搭建nutch开发环境步骤

    Nutch是Apache软件基金会的一个开源项目,主要用于构建网络搜索引擎。它提供了一个可扩展的、高度模块化的框架,用于抓取、解析网页,并建立索引,是大数据和信息检索领域的...祝你在搭建Nutch环境的过程中一切顺利!

    Apache Nutch 1.7 在windows和Linux下的安装

    Apache Nutch 1.7 在 Windows 和 Linux 下的安装 Apache Nutch 1.7 是一个开源的网络爬虫和搜索引擎项目,它可以对互联网上的网页进行爬取、索引和搜索。本文将详细介绍 Apache Nutch 1.7 在 Windows 和 Linux 下的...

    nutch在windows下myeclipse中安装配置并且运行教程

    详细介绍nutch在windows下myeclipse中的配置以及执行,本人空间还有nutch1.2包,nwgwin安装包等

    nutch_1.4在windows下安装配置.pdf

    ### Nutch 1.4 在 Windows 下的安装与配置知识点详解 #### 一、Nutch 简介 - **定义**: Apache Nutch 是一款基于 Java 的开源网页爬虫项目,能够自动抓取互联网上的网页及其内部链接,并对其进行索引处理。 - **...

    windows下nutch的安装配置以及与tomcat的集成.doc

    ### Windows下Nutch的安装配置与Tomcat集成详解 #### Nutch概述 Nutch是一款开源的搜索引擎框架,基于Java开发,旨在提供一个完整的搜索引擎解决方案。它由两大部分组成:抓取部分(Crawler),负责抓取网页数据并...

    nutch-0.9 环境搭建所需最小cygwin

    本篇文章将详细讲解如何在Windows环境下,使用Cygwin搭建Nutch-0.9的运行环境。 首先,我们需要理解Cygwin是什么。Cygwin是一个在Windows上模拟Linux环境的开源软件,它提供了许多在Linux环境下才能运行的命令行...

Global site tag (gtag.js) - Google Analytics