- 浏览: 295786 次
- 性别:
- 来自: 上海
文章分类
- 全部博客 (155)
- Liferay portal研究 (23)
- spring研究 (7)
- Displaytag (2)
- Flash Builder (0)
- 搜索引擎 (12)
- 杂项 (17)
- SCM管理 (7)
- Jquery (5)
- Linux (7)
- Oracle (10)
- httpd集成 (3)
- Maven2 (5)
- 企业管理 (1)
- tomcat高级 (4)
- dos命令 (1)
- ldap (2)
- Java (8)
- webservice (1)
- jetty代码研究 (3)
- OpenCMS (1)
- JMX (2)
- hibernate (5)
- Ant (1)
- js tree (4)
- Quartz (0)
- CMS (1)
- springside (1)
- proxool (1)
- freemarker (1)
- Cookie (1)
- CAS SSO (4)
- mysql (1)
- php (1)
- js (2)
- Asset (1)
- openmeeting (1)
- h2数据库 (2)
- wcf vs java ws (1)
最新评论
-
22199143:
...
当在重启Tomcat容器时 Exception in Thread "HouseKeeper" java.lang.NullPointerException -
liuqq:
一直用Oracle开发,几乎没有接触过其他数据库。使用Mysq ...
The Nested Set Model -
yjsxxgm:
yjsxxgm 写道FFFFFFFFFFFFFFFWWW
java 访问wcf -
yjsxxgm:
FFFFFFFFFFFFFFF
java 访问wcf -
hjp222:
scanIntervalSeconds 是重新启动,并非真正的 ...
Jetty 热部署
参考官方文档,通过努力终于我nutch在eclispe下跑通了:),真的很好,很详细,不知道自己有没耐心些这样的文章:)
This is a work in progress. If you find errors or would like to improve this page, just create an account [UserPreferences] and start editing this page
Tested with
Nutch release 1.0
Eclipse 3.3 (Europa) and 3.4 (Ganymede)
Java 1.6
Ubuntu (should work on most platforms though)
Windows XP and Vista
Before you start
Setting up Nutch to run into Eclipse can be tricky, and most of the time it is much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. Sometimes examining the logs (logs/hadoop.log) is quicker to debug a problem.
Steps
For Windows Users
If you are running Windows (tested on Windows XP) you must first install cygwin. Download it from http://www.cygwin.com/setup.exe
Install cygwin and set the PATH environment variable for it. You can set it from the Control Panel, System, Advanced Tab, Environment Variables and edit/add PATH.
Example PATH:
C:\Sun\SDK\bin;C:\cygwin\bin
If you run "bash" from the Windows command line (Start > Run... > cmd.exe) it should successfully run cygwin.
If you are running Eclipse on Vista, you will need to either give cygwin administrative privileges or turn off Vista's User Access Control (UAC). Otherwise Hadoop will likely complain that it cannot change a directory permission when you later run the crawler:
org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions of ... Permission denied
See this for more information about the UAC issue.
Install Nutch
Grab a fresh release of Nutch 1.0 or download and untar the official 1.0 release.
Do not build Nutch yet. Make sure you have no .project and .classpath files in the Nutch directory
Create a new Java Project in Eclipse
File > New > Project > Java project > click Next
Name the project (Nutch_Trunk for instance)
Select "Create project from existing source" and use the location where you downloaded Nutch
Click on Next, and wait while Eclipse is scanning the folders
Add the folder "conf" to the classpath (click the "Libraries" tab, click "Add Class Folder..." button, and select "conf" from the list)
Go to "Order and Export" tab, find the entry for added "conf" folder and move it to the top (by checking it and clicking the "Top" button). This is required so Eclipse will take config (nutch-default.xml, nutch-final.xml, etc.) resources from our "conf" folder and not from somewhere else.
Eclipse should have guessed all the Java files that must be added to your classpath. If that's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
Click the "Source" tab and set the default output folder to "Nutch_Trunk/bin/tmp_build". (You may need to create the tmp_build folder.)
Click the "Finish" button
DO NOT add "build" to classpath
Configure Nutch
See the Tutorial
Change the property "plugin.folders" to "./src/plugin" on $NUTCH_HOME/conf/nutch-defaul.xml
Make sure Nutch is configured correctly before testing it into Eclipse
Missing org.farng and com.etranslate
Eclipse will complain about some import statements in parse-mp3 and parse-rtf plugins (30 errors in my case). Because of incompatibility with the Apache license, the .jar files that define the necessary classes were not included with the source code.
Download them here:
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder > Build Path > Configure Build Path... Then select the Libraries tab, click "Add Jars..." and then add each .jar file individually).
Two Errors with RTFParseFactory
If you are trying to build the official 1.0 release, Eclipse will complain about 2 errors regarding the RTFParseFactory (this is after adding the RTF jar file from the previous step). This problem was fixed (see NUTCH-644 and NUTCH-705) but was not included in the 1.0 official release because of licensing issues. So you will need to manually alter the code to remove these 2 build errors.
In RTFParseFactory.java:
Add the following import statement: import org.apache.nutch.parse.ParseResult;
Change
public Parse getParse(Content content) {
to
public ParseResult getParse(Content content) {
In the getParse function, replace
return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_EXCEPTION,
e.toString()).getEmptyParse(conf);
with
return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_EXCEPTION,
e.toString()).getEmptyParseResult(content.getUrl(), getConf());
In the getParse function, replace
return new ParseImpl(text,
new ParseData(ParseStatus.STATUS_SUCCESS,
title,
OutlinkExtractor.getOutlinks(text, this.conf),
content.getMetadata(),
metadata));
with
return ParseResult.createParseResult(content.getUrl(),
new ParseImpl(text,
new ParseData(ParseStatus.STATUS_SUCCESS,
title,
OutlinkExtractor.getOutlinks(text, this.conf),
content.getMetadata(),
metadata)));
In TestRTFParser.java, replace
parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content);
with
parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content).get(urlString);
Once you have made these changes and saved the files, Eclipse should build with no errors.
Build Nutch
If you setup the project correctly, Eclipse will build Nutch for you into "tmp_build". See below for problems you could run into.
Create Eclipse launcher
Menu Run > "Run..."
create "New" for "Java Application"
set in Main class
org.apache.nutch.crawl.Crawl
on tab Arguments, Program Arguments
urls -dir crawl -depth 3 -topN 50
in VM arguments
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
click on "Run"
if all works, you should see Nutch getting busy at crawling
Debug Nutch in Eclipse (not yet tested for 0.9)
Set breakpoints and debug a crawl
It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs. Here are a few good places to set breakpoints:
Fetcher [line: 371] - run
Fetcher [line: 438] - fetch
Fetcher$FetcherThread [line: 149] - run()
Generator [line: 281] - generate
Generator$Selector [line: 119] - map
OutlinkExtractor [line: 111] - getOutlinks
If things do not work...
Yes, Nutch and Eclipse can be a difficult companionship sometimes
Java Heap Size problem
If the crawler throws an IOException exception early in the crawl (Exception in thread "main" java.io.IOException: Job failed!), check the logs/hadoop.log file for further information. If you find in hadoop.log lines similar to this:
2009-04-13 13:41:06,105 WARN mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space
then you should increase amount of RAM for running applications from Eclipse.
Just set it in:
Eclipse -> Window -> Preferences -> Java -> Installed JREs -> edit -> Default VM arguments
I've set mine to
-Xms5m -Xmx150m
because I have like 200MB RAM left after running all apps
-Xms (minimum ammount of RAM memory for running applications) -Xmx (maximum)
Eclipse: Cannot create project content in workspace
The nutch source code must be out of the workspace folder. My first attempt was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine.
plugin dir not found
Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absolute one as well in nutch-defaults.xml or may be better in nutch-site.xml
<property>
<name>plugin.folders</name>
<value>/home/....../nutch-0.9/src/plugin</value>
No plugins loaded during unit tests in Eclipse
During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well.
Unit tests work in eclipse but fail when running ant in the command line
Suppose your unit tests work perfectly in eclipse, but each and everyone fail when running ant test in the command line - including the ones you haven't modified. Check if you defined the plugin.folders property in hadoop-site.xml. In that case, try removing it from that file and adding it directly to nutch-site.xml
Run ant test again. That should have solved the problem.
If that didn't solve the problem, are you testing a plugin? If so, did you add the plugin to the list of packages in plugin\build.xml, on the test target?
classNotFound
open the class itself, rightclick
refresh the build dir
debugging hadoop classes
Sometime it makes sense to also have the hadoop classes available during debugging. So, you can check out the Hadoop sources on your machine and add the sources to the hadoop-xxx.jar. Alternatively, you can:
Remove the hadoopXXX.jar from your classpath libraries
Checkout the hadoop brunch that is used within nutch
configure a hadoop project similar to the nutch project within your eclipse
add the hadoop project as a dependent project of nutch project
you can now also set break points within hadoop classes lik inputformat implementations etc.
Failed to get the current user's information
On Windows, if the crawler throws an exception complaining it "Failed to get the current user's information" or 'Login failed: Cannot run program "bash"', it is likely you forgot to set the PATH to point to cygwin. Open a new command line window (All Programs > Accessories > Command Prompt) and type "bash". This should start cygwin. If it doesn't, type "path" to see your path. You should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See the steps to adding this to your PATH at the top of the article under "For Windows Users". After setting the PATH, you will likely need to restart Eclipse so it will use the new PATH.
Original credits: RenaudRichardet
This is a work in progress. If you find errors or would like to improve this page, just create an account [UserPreferences] and start editing this page
Tested with
Nutch release 1.0
Eclipse 3.3 (Europa) and 3.4 (Ganymede)
Java 1.6
Ubuntu (should work on most platforms though)
Windows XP and Vista
Before you start
Setting up Nutch to run into Eclipse can be tricky, and most of the time it is much faster if you edit Nutch in Eclipse but run the scripts from the command line (my 2 cents). However, it's very useful to be able to debug Nutch in Eclipse. Sometimes examining the logs (logs/hadoop.log) is quicker to debug a problem.
Steps
For Windows Users
If you are running Windows (tested on Windows XP) you must first install cygwin. Download it from http://www.cygwin.com/setup.exe
Install cygwin and set the PATH environment variable for it. You can set it from the Control Panel, System, Advanced Tab, Environment Variables and edit/add PATH.
Example PATH:
C:\Sun\SDK\bin;C:\cygwin\bin
If you run "bash" from the Windows command line (Start > Run... > cmd.exe) it should successfully run cygwin.
If you are running Eclipse on Vista, you will need to either give cygwin administrative privileges or turn off Vista's User Access Control (UAC). Otherwise Hadoop will likely complain that it cannot change a directory permission when you later run the crawler:
org.apache.hadoop.util.Shell$ExitCodeException: chmod: changing permissions of ... Permission denied
See this for more information about the UAC issue.
Install Nutch
Grab a fresh release of Nutch 1.0 or download and untar the official 1.0 release.
Do not build Nutch yet. Make sure you have no .project and .classpath files in the Nutch directory
Create a new Java Project in Eclipse
File > New > Project > Java project > click Next
Name the project (Nutch_Trunk for instance)
Select "Create project from existing source" and use the location where you downloaded Nutch
Click on Next, and wait while Eclipse is scanning the folders
Add the folder "conf" to the classpath (click the "Libraries" tab, click "Add Class Folder..." button, and select "conf" from the list)
Go to "Order and Export" tab, find the entry for added "conf" folder and move it to the top (by checking it and clicking the "Top" button). This is required so Eclipse will take config (nutch-default.xml, nutch-final.xml, etc.) resources from our "conf" folder and not from somewhere else.
Eclipse should have guessed all the Java files that must be added to your classpath. If that's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
Click the "Source" tab and set the default output folder to "Nutch_Trunk/bin/tmp_build". (You may need to create the tmp_build folder.)
Click the "Finish" button
DO NOT add "build" to classpath
Configure Nutch
See the Tutorial
Change the property "plugin.folders" to "./src/plugin" on $NUTCH_HOME/conf/nutch-defaul.xml
Make sure Nutch is configured correctly before testing it into Eclipse
Missing org.farng and com.etranslate
Eclipse will complain about some import statements in parse-mp3 and parse-rtf plugins (30 errors in my case). Because of incompatibility with the Apache license, the .jar files that define the necessary classes were not included with the source code.
Download them here:
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/
http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/
Copy the jar files into src/plugin/parse-mp3/lib and src/plugin/parse-rtf/lib/ respectively. Then add the jar files to the build path (First refresh the workspace by pressing F5. Then right-click the project folder > Build Path > Configure Build Path... Then select the Libraries tab, click "Add Jars..." and then add each .jar file individually).
Two Errors with RTFParseFactory
If you are trying to build the official 1.0 release, Eclipse will complain about 2 errors regarding the RTFParseFactory (this is after adding the RTF jar file from the previous step). This problem was fixed (see NUTCH-644 and NUTCH-705) but was not included in the 1.0 official release because of licensing issues. So you will need to manually alter the code to remove these 2 build errors.
In RTFParseFactory.java:
Add the following import statement: import org.apache.nutch.parse.ParseResult;
Change
public Parse getParse(Content content) {
to
public ParseResult getParse(Content content) {
In the getParse function, replace
return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_EXCEPTION,
e.toString()).getEmptyParse(conf);
with
return new ParseStatus(ParseStatus.FAILED,
ParseStatus.FAILED_EXCEPTION,
e.toString()).getEmptyParseResult(content.getUrl(), getConf());
In the getParse function, replace
return new ParseImpl(text,
new ParseData(ParseStatus.STATUS_SUCCESS,
title,
OutlinkExtractor.getOutlinks(text, this.conf),
content.getMetadata(),
metadata));
with
return ParseResult.createParseResult(content.getUrl(),
new ParseImpl(text,
new ParseData(ParseStatus.STATUS_SUCCESS,
title,
OutlinkExtractor.getOutlinks(text, this.conf),
content.getMetadata(),
metadata)));
In TestRTFParser.java, replace
parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content);
with
parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content).get(urlString);
Once you have made these changes and saved the files, Eclipse should build with no errors.
Build Nutch
If you setup the project correctly, Eclipse will build Nutch for you into "tmp_build". See below for problems you could run into.
Create Eclipse launcher
Menu Run > "Run..."
create "New" for "Java Application"
set in Main class
org.apache.nutch.crawl.Crawl
on tab Arguments, Program Arguments
urls -dir crawl -depth 3 -topN 50
in VM arguments
-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
click on "Run"
if all works, you should see Nutch getting busy at crawling
Debug Nutch in Eclipse (not yet tested for 0.9)
Set breakpoints and debug a crawl
It can be tricky to find out where to set the breakpoint, because of the Hadoop jobs. Here are a few good places to set breakpoints:
Fetcher [line: 371] - run
Fetcher [line: 438] - fetch
Fetcher$FetcherThread [line: 149] - run()
Generator [line: 281] - generate
Generator$Selector [line: 119] - map
OutlinkExtractor [line: 111] - getOutlinks
If things do not work...
Yes, Nutch and Eclipse can be a difficult companionship sometimes
Java Heap Size problem
If the crawler throws an IOException exception early in the crawl (Exception in thread "main" java.io.IOException: Job failed!), check the logs/hadoop.log file for further information. If you find in hadoop.log lines similar to this:
2009-04-13 13:41:06,105 WARN mapred.LocalJobRunner - job_local_0001
java.lang.OutOfMemoryError: Java heap space
then you should increase amount of RAM for running applications from Eclipse.
Just set it in:
Eclipse -> Window -> Preferences -> Java -> Installed JREs -> edit -> Default VM arguments
I've set mine to
-Xms5m -Xmx150m
because I have like 200MB RAM left after running all apps
-Xms (minimum ammount of RAM memory for running applications) -Xmx (maximum)
Eclipse: Cannot create project content in workspace
The nutch source code must be out of the workspace folder. My first attempt was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine.
plugin dir not found
Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absolute one as well in nutch-defaults.xml or may be better in nutch-site.xml
<property>
<name>plugin.folders</name>
<value>/home/....../nutch-0.9/src/plugin</value>
No plugins loaded during unit tests in Eclipse
During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well.
Unit tests work in eclipse but fail when running ant in the command line
Suppose your unit tests work perfectly in eclipse, but each and everyone fail when running ant test in the command line - including the ones you haven't modified. Check if you defined the plugin.folders property in hadoop-site.xml. In that case, try removing it from that file and adding it directly to nutch-site.xml
Run ant test again. That should have solved the problem.
If that didn't solve the problem, are you testing a plugin? If so, did you add the plugin to the list of packages in plugin\build.xml, on the test target?
classNotFound
open the class itself, rightclick
refresh the build dir
debugging hadoop classes
Sometime it makes sense to also have the hadoop classes available during debugging. So, you can check out the Hadoop sources on your machine and add the sources to the hadoop-xxx.jar. Alternatively, you can:
Remove the hadoopXXX.jar from your classpath libraries
Checkout the hadoop brunch that is used within nutch
configure a hadoop project similar to the nutch project within your eclipse
add the hadoop project as a dependent project of nutch project
you can now also set break points within hadoop classes lik inputformat implementations etc.
Failed to get the current user's information
On Windows, if the crawler throws an exception complaining it "Failed to get the current user's information" or 'Login failed: Cannot run program "bash"', it is likely you forgot to set the PATH to point to cygwin. Open a new command line window (All Programs > Accessories > Command Prompt) and type "bash". This should start cygwin. If it doesn't, type "path" to see your path. You should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See the steps to adding this to your PATH at the top of the article under "For Windows Users". After setting the PATH, you will likely need to restart Eclipse so it will use the new PATH.
Original credits: RenaudRichardet
发表评论
-
如何做好垂直搜索
2009-10-03 20:28 1044垂直搜索技术主要分为 ... -
hadoop的eclipse开发豪华文章
2009-09-28 21:57 1598hadoop的eclipse开发豪华文章:http://ebi ... -
Hadoop分析
2009-09-27 10:10 1125好链接:http://www.kuqin ... -
Nutch1.0 Crawl整体代码分析
2009-09-27 09:02 3126============================== ... -
Nutch插件机制
2009-09-26 20:41 1296Plugin插件机制为Nutch提 ... -
nutch防止中文乱码
2009-09-19 13:59 958<Connector port="8080&q ... -
Nutch中MapReduce的分析
2009-09-17 21:42 1114作者:马士华 发表于:2008-03-06 20:11 最后更 ... -
Nutch1.0源码分析-----抓取部分
2009-09-17 21:33 970链接地址:http://blog.csdn.net/ninju ... -
nutch1.0各种命令
2009-09-17 20:53 1138nutch.job 文件的使用: hadoop jar nu ... -
常用网址
2009-09-16 22:08 919吴楚狂生 nutch 总结:http://blog.csdn. ... -
Nutch插件机制和Nutch一个插件实例
2009-09-16 21:20 1568通过这篇文章:),终于懂得了nutch插件啦:),Thanks ...
相关推荐
本文将详细解析如何在Eclipse中配置Nutch,以便于开发者更好地理解和操作这一过程。 ### 一、理解Nutch与Eclipse的结合 Nutch是一个基于Hadoop的框架,用于构建可扩展且高性能的网络爬虫。它不仅能够抓取网页,还...
在Linux环境中使用Eclipse编译Nutch-1.0,首要任务是确保开发环境满足项目需求。这包括确认Eclipse的JDK、JRE版本至少为1.6或更高版本。这是因为Nutch作为Apache旗下的开源Web爬虫项目,其运行依赖于Java平台,并对...
Apache Nutch 1.7 在 Windows 和 Linux 下的安装 Apache Nutch 1.7 是一个开源的网络爬虫和搜索引擎项目,它可以对互联网上的网页进行爬取、索引和搜索。本文将详细介绍 Apache Nutch 1.7 在 Windows 和 Linux 下的...
Nutch 是一个开源的、Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。
ant-eclipse-1.0 nutch
Nutch1.0的API,不过注意没有搜索功能
### Eclipse中编译Nutch-0.9:详解与步骤 #### 核心知识点概览 在本篇文章中,我们将深入探讨如何在Eclipse环境中编译Nutch-0.9,一个开源的网络爬虫项目,用于抓取互联网上的网页信息。文章涵盖的关键知识点包括...
Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。
在本文中,我们将深入探讨如何在 Linux 环境下使用 Eclipse 编译 Apache Nutch 1.0。Apache Nutch 是一个开源的网络爬虫框架,主要用于抓取和索引网页内容。Eclipse 是一个广泛使用的 Java 开发集成环境,它支持多种...
因为 Nutch 原生设计在 Linux 环境下运行,所以在 Windows 上配置 Nutch 需要一些额外的步骤。主要有两种方法: - **在 Eclipse 中使用 Nutch**: - 创建一个新的 Java 项目。 - 导入 Nutch 源码,将 `src\java`...
本文将详细介绍如何在Windows环境下配置Nutch 1.4,并使用Eclipse进行开发。以下是你需要知道的关键步骤: 1. **安装JDK**: 在配置Nutch之前,首先确保已安装Java Development Kit (JDK)。这里推荐使用JDK 1.6。...
- **用途**: 由于 Nutch 的脚本采用 Linux Shell 编写,因此在 Windows 平台上需要 Cygwin 来模拟 Linux 系统环境。 3. **Nutch 1.4** - **下载地址**: [http://nutch.apache.org/](http://nutch.apache.org/) - ...
Nutch 是一个开源的、Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。 nutch 1.0
Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。
Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。
Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。
apache-nutch-2.2.1(Eclipse直接运行版)今天刚做的,发现有很多坑,分享给大家实验,JDK1.7 Win10。我分享的两个压缩卷一起下载才可以用,资源限制太小了 002地址:...
Nutch-1.0分布式安装手册是一份详细指导如何在多台计算机上部署和配置Apache Nutch的文档。Apache Nutch是一款开源的网络爬虫软件,用于抓取互联网上的网页并进行索引,是大数据领域中搜索引擎构建的重要工具。这份...