- 浏览: 961033 次
- 性别:
- 来自: 北京
文章分类
- 全部博客 (385)
- 搜索引擎学习 (62)
- 算法 (1)
- 数据库 (15)
- web开发 (38)
- solr开发 (17)
- nutch 1.2 系统学习 (8)
- cms (1)
- 系统架构 (11)
- linux 与 unix 编程 (16)
- android (15)
- maven (1)
- 关注物流 (1)
- 网址收集 (1)
- 分布式,集群 (1)
- mysql (5)
- apache (1)
- 资料文档备份 (7)
- 上班有感 (0)
- 工作流 (15)
- javascript (1)
- weblogic (1)
- eclipse 集成 (1)
- JMS (7)
- Hibernate (1)
- 性能测试 (1)
- spring (6)
- 缓存cache (1)
- mongodb (2)
- webservice (1)
- HTML5 COCOS2D-HTML5 (1)
- BrowserQuest (2)
最新评论
-
avi9111:
内陷到android, ios, winphone里面也是随便 ...
【HTML5游戏开发】二次开发 BrowserQuest 第一集 -
avi9111:
呵呵,做不下去了吧,没有第二集了吧,游戏是个深坑,谨慎进入,其 ...
【HTML5游戏开发】二次开发 BrowserQuest 第一集 -
excaliburace:
方案3亲测完全可用,顺便解决了我其他方面的一些疑问,非常感谢
spring security 2添加用户验证码 -
yuanliangding:
Spring太强大了。
Spring Data JPA 简单介绍 -
小高你好:
什么是hibernate懒加载?什么时候用懒加载?为什么要用懒加载?
今尝试下给nutch1.2增加一个插件,于是到官网找了个例子,链接如下:
http://wiki.apache.org/nutch/WritingPluginExample-0.9
这个例子实现的的是推荐网站,就是写关键字在content里,当别人搜索这个关键字时,你推荐的网站在搜索结果中排前,要实现推荐必须在你的网页上加上
view plaincopy to clipboardprint?
<meta name="recommended" content="plugins" />
<meta name="recommended" content="plugins" />
这条属性才能被插件识别。
由于它这个例子是用nutch0.9的,而且1.2和0.9有些区别,于是要修改一些代码。步骤如下:
1.插件开放
1.1在src/plugin中新建一个文件夹recommend
1.2.在recommend目录下新建Plugin.xml和Build.xml文件,内容如下:
Plugin.xml
view plaincopy to clipboardprint?
<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="recommended"
name="Recommended Parser/Filter"
version="0.0.1"
provider-name="nutch.org">
<runtime>
<!-- As defined in build.xml this plugin will end up bundled as recommended.jar -->
<library name="recommended.jar">
<export name="*"/>
</library>
</runtime>
<!-- The RecommendedParser extends the HtmlParseFilter to grab the contents of
any recommended meta tags -->
<extension id="org.apache.nutch.parse.recommended.recommendedfilter"
name="Recommended Parser"
point="org.apache.nutch.parse.HtmlParseFilter">
<implementation id="RecommendedParser"
class="org.apache.nutch.parse.recommended.RecommendedParser"/>
</extension>
<!-- TheRecommendedIndexer extends the IndexingFilter in order to add the contents
of the recommended meta tags (as found by the RecommendedParser) to the lucene
index. -->
<extension id="org.apache.nutch.parse.recommended.recommendedindexer"
name="Recommended identifier filter"
point="org.apache.nutch.indexer.IndexingFilter">
<implementation id="RecommendedIndexer"
class="org.apache.nutch.parse.recommended.RecommendedIndexer"/>
</extension>
<!-- The RecommendedQueryFilter gets called when you perform a search. It runs a
search for the user's query against the recommended fields. In order to get
add this to the list of filters that gets run by default, you have to use
"fields=DEFAULT". -->
<extension id="org.apache.nutch.parse.recommended.recommendedSearcher"
name="Recommended Search Query Filter"
point="org.apache.nutch.searcher.QueryFilter">
<implementation id="RecommendedQueryFilter"
class="org.apache.nutch.parse.recommended.RecommendedQueryFilter">
<parameter name="fields" value="recommended"/>
</implementation>
</extension>
</plugin>
<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="recommended"
name="Recommended Parser/Filter"
version="0.0.1"
provider-name="nutch.org">
<runtime>
<!-- As defined in build.xml this plugin will end up bundled as recommended.jar -->
<library name="recommended.jar">
<export name="*"/>
</library>
</runtime>
<!-- The RecommendedParser extends the HtmlParseFilter to grab the contents of
any recommended meta tags -->
<extension id="org.apache.nutch.parse.recommended.recommendedfilter"
name="Recommended Parser"
point="org.apache.nutch.parse.HtmlParseFilter">
<implementation id="RecommendedParser"
class="org.apache.nutch.parse.recommended.RecommendedParser"/>
</extension>
<!-- TheRecommendedIndexer extends the IndexingFilter in order to add the contents
of the recommended meta tags (as found by the RecommendedParser) to the lucene
index. -->
<extension id="org.apache.nutch.parse.recommended.recommendedindexer"
name="Recommended identifier filter"
point="org.apache.nutch.indexer.IndexingFilter">
<implementation id="RecommendedIndexer"
class="org.apache.nutch.parse.recommended.RecommendedIndexer"/>
</extension>
<!-- The RecommendedQueryFilter gets called when you perform a search. It runs a
search for the user's query against the recommended fields. In order to get
add this to the list of filters that gets run by default, you have to use
"fields=DEFAULT". -->
<extension id="org.apache.nutch.parse.recommended.recommendedSearcher"
name="Recommended Search Query Filter"
point="org.apache.nutch.searcher.QueryFilter">
<implementation id="RecommendedQueryFilter"
class="org.apache.nutch.parse.recommended.RecommendedQueryFilter">
<parameter name="fields" value="recommended"/>
</implementation>
</extension>
</plugin>
Build.xml
view plaincopy to clipboardprint?
<?xml version="1.0"?>
<project name="recommended" default="jar-core">
<import file="../build-plugin.xml"/>
<!-- Build compilation dependencies -->
<target name="deps-jar">
<ant target="jar" inheritall="false" dir="../lib-xml"/>
</target>
<!-- Add compilation dependencies to classpath -->
<path id="plugin.deps">
<fileset dir="${nutch.root}/build">
<include name="**/lib-xml/*.jar" />
</fileset>
</path>
<!-- Deploy Unit test dependencies -->
<target name="deps-test">
<ant target="deploy" inheritall="false" dir="../lib-xml"/>
<ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
<ant target="deploy" inheritall="false" dir="../protocol-file"/>
</target>
<!-- for junit test -->
<mkdir dir="${build.test}/data"/>
<copy file="data/recommended.html" todir="${build.test}/data"/>
</project>
<?xml version="1.0"?>
<project name="recommended" default="jar-core">
<import file="../build-plugin.xml"/>
<!-- Build compilation dependencies -->
<target name="deps-jar">
<ant target="jar" inheritall="false" dir="../lib-xml"/>
</target>
<!-- Add compilation dependencies to classpath -->
<path id="plugin.deps">
<fileset dir="${nutch.root}/build">
<include name="**/lib-xml/*.jar" />
</fileset>
</path>
<!-- Deploy Unit test dependencies -->
<target name="deps-test">
<ant target="deploy" inheritall="false" dir="../lib-xml"/>
<ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
<ant target="deploy" inheritall="false" dir="../protocol-file"/>
</target>
<!-- for junit test -->
<mkdir dir="${build.test}/data"/>
<copy file="data/recommended.html" todir="${build.test}/data"/>
</project>
1.3.在recommended目录下建立\src\java\org\apache\nutch\parse\recommended目录。
1.4.增加RecommendedIndexer.java,RecommendedParser.java,RecommendedQueryFilter.java三个类,内容如下:
RecommendedIndexer.java
view plaincopy to clipboardprint?
package org.apache.nutch.parse.recommended;
// JDK import
import java.util.logging.Logger;
// Commons imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
// Nutch imports
import org.apache.nutch.util.LogUtil;
import org.apache.nutch.fetcher.FetcherOutput;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.parse.Parse;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
// Lucene imports
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Document;
public class RecommendedIndexer implements IndexingFilter {
public static final Log LOG = LogFactory.getLog(RecommendedIndexer.class.getName());
private Configuration conf;
public RecommendedIndexer() {
}
@Override
public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks)
throws IndexingException {
String recommendation = parse.getData().getMeta("recommended");
if (recommendation != null) {
Field recommendedField =
new Field("recommended", recommendation,
Field.Store.YES, Field.Index.NOT_ANALYZED);
recommendedField.setBoost(5.0f);
doc.add("recommended",recommendedField);
LOG.info("Added " + recommendation + " to the recommended Field");
}
return doc;
}
public void setConf(Configuration conf) {
this.conf = conf;
}
public Configuration getConf() {
return this.conf;
}
@Override
public void addIndexBackendOptions(Configuration conf) {
// TODO Auto-generated method stub
}
}
package org.apache.nutch.parse.recommended;
// JDK import
import java.util.logging.Logger;
// Commons imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
// Nutch imports
import org.apache.nutch.util.LogUtil;
import org.apache.nutch.fetcher.FetcherOutput;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.parse.Parse;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
// Lucene imports
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Document;
public class RecommendedIndexer implements IndexingFilter {
public static final Log LOG = LogFactory.getLog(RecommendedIndexer.class.getName());
private Configuration conf;
public RecommendedIndexer() {
}
@Override
public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks)
throws IndexingException {
String recommendation = parse.getData().getMeta("recommended");
if (recommendation != null) {
Field recommendedField =
new Field("recommended", recommendation,
Field.Store.YES, Field.Index.NOT_ANALYZED);
recommendedField.setBoost(5.0f);
doc.add("recommended",recommendedField);
LOG.info("Added " + recommendation + " to the recommended Field");
}
return doc;
}
public void setConf(Configuration conf) {
this.conf = conf;
}
public Configuration getConf() {
return this.conf;
}
@Override
public void addIndexBackendOptions(Configuration conf) {
// TODO Auto-generated method stub
}
}
RecommendedParser.java
view plaincopy to clipboardprint?
package org.apache.nutch.parse.recommended;
// JDK imports
import java.util.Enumeration;
import java.util.Properties;
import java.util.logging.Logger;
// Nutch imports
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.HTMLMetaTags;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.HtmlParseFilter;
import org.apache.nutch.parse.ParseResult;
import org.apache.nutch.protocol.Content;
// Commons imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
// W3C imports
import org.w3c.dom.DocumentFragment;
public class RecommendedParser implements HtmlParseFilter {
private static final Log LOG = LogFactory.getLog(RecommendedParser.class.getName());
private Configuration conf;
/** The Recommended meta data attribute name */
public static final String META_RECOMMENDED_NAME="recommended";
/**
* Scan the HTML document looking for a recommended meta tag.
*/
@Override
public ParseResult filter(Content content, ParseResult parseResult,
HTMLMetaTags metaTags, DocumentFragment doc) {
// Trying to find the document's recommended term
String recommendation = null;
Properties generalMetaTags = metaTags.getGeneralTags();
for (Enumeration tagNames = generalMetaTags.propertyNames(); tagNames.hasMoreElements(); ) {
if (tagNames.nextElement().equals("recommended")) {
System.out.println(generalMetaTags.getProperty("recommended"));
recommendation = generalMetaTags.getProperty("recommended");
LOG.info("Found a Recommendation for " + recommendation);
}
}
if (recommendation == null) {
LOG.info("No Recommendation");
} else {
LOG.info("Adding Recommendation for " + recommendation);
Parse parse = parseResult.get(content.getUrl());
parse.getData().getContentMeta().set(META_RECOMMENDED_NAME, recommendation);
}
return parseResult;
}
public void setConf(Configuration conf) {
this.conf = conf;
}
public Configuration getConf() {
return this.conf;
}
}
package org.apache.nutch.parse.recommended;
// JDK imports
import java.util.Enumeration;
import java.util.Properties;
import java.util.logging.Logger;
// Nutch imports
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.HTMLMetaTags;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.HtmlParseFilter;
import org.apache.nutch.parse.ParseResult;
import org.apache.nutch.protocol.Content;
// Commons imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
// W3C imports
import org.w3c.dom.DocumentFragment;
public class RecommendedParser implements HtmlParseFilter {
private static final Log LOG = LogFactory.getLog(RecommendedParser.class.getName());
private Configuration conf;
/** The Recommended meta data attribute name */
public static final String META_RECOMMENDED_NAME="recommended";
/**
* Scan the HTML document looking for a recommended meta tag.
*/
@Override
public ParseResult filter(Content content, ParseResult parseResult,
HTMLMetaTags metaTags, DocumentFragment doc) {
// Trying to find the document's recommended term
String recommendation = null;
Properties generalMetaTags = metaTags.getGeneralTags();
for (Enumeration tagNames = generalMetaTags.propertyNames(); tagNames.hasMoreElements(); ) {
if (tagNames.nextElement().equals("recommended")) {
System.out.println(generalMetaTags.getProperty("recommended"));
recommendation = generalMetaTags.getProperty("recommended");
LOG.info("Found a Recommendation for " + recommendation);
}
}
if (recommendation == null) {
LOG.info("No Recommendation");
} else {
LOG.info("Adding Recommendation for " + recommendation);
Parse parse = parseResult.get(content.getUrl());
parse.getData().getContentMeta().set(META_RECOMMENDED_NAME, recommendation);
}
return parseResult;
}
public void setConf(Configuration conf) {
this.conf = conf;
}
public Configuration getConf() {
return this.conf;
}
}
RecommendedQueryFilter.java
view plaincopy to clipboardprint?
package org.apache.nutch.parse.recommended;
import org.apache.nutch.searcher.FieldQueryFilter;
import java.util.logging.Logger;
// Commons imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
public class RecommendedQueryFilter extends FieldQueryFilter {
private static final Log LOG = LogFactory.getLog(RecommendedParser.class.getName());
public RecommendedQueryFilter() {
super("recommended", 5f);
LOG.info("Added a recommended query");
}
}
package org.apache.nutch.parse.recommended;
import org.apache.nutch.searcher.FieldQueryFilter;
import java.util.logging.Logger;
// Commons imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
public class RecommendedQueryFilter extends FieldQueryFilter {
private static final Log LOG = LogFactory.getLog(RecommendedParser.class.getName());
public RecommendedQueryFilter() {
super("recommended", 5f);
LOG.info("Added a recommended query");
}
}
1.5.在 src/plugin/build.xml 中的<target name="deploy"></target>中增加一行:
view plaincopy to clipboardprint?
<ant dir="recommended" target="deploy" />
<ant dir="recommended" target="deploy" />
1.6.运行cmd,切换到recommend目录,运行ant命令编译,插件开发完成。
1.7 让nutch识别你的插件
在conf/nutch-site.xml 中增加一下属性
view plaincopy to clipboardprint?
<property>
<name>plugin.includes</name>
<value>recommended|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin id names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>recommended|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin id names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
2.编写插件测试类
2.1 在src/plugin中/recommend目录下新建一个data目录,在data目录下新建一个html文件recommended.html内容如下:
view plaincopy to clipboardprint?
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>recommended</title>
<meta name="generator" content="TextMate http://macromates.com/">
<meta name="author" content="Ricardo J. Méndez">
<meta name="recommended" content="recommended-content"/>
<!-- Date: 2007-02-12 -->
</head>
<body>
Recommended meta tag test.
</body>
</html>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>recommended</title>
<meta name="generator" content="TextMate http://macromates.com/">
<meta name="author" content="Ricardo J. Méndez">
<meta name="recommended" content="recommended-content"/>
<!-- Date: 2007-02-12 -->
</head>
<body>
Recommended meta tag test.
</body>
</html>
2.2 在src/plugin中/recommend目录下新建src/test/org/apache/nutch/parse/recommended目录,增加TestRecommendedParser.java类,内容如下:
view plaincopy to clipboardprint?
package org.apache.nutch.parse.recommended;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseResult;
import org.apache.nutch.parse.ParseUtil;
import org.apache.nutch.protocol.Content;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.util.NutchConfiguration;
import java.util.Properties;
import java.io.*;
import java.net.URL;
import junit.framework.TestCase;
/*
* Loads test page recommended.html and verifies that the recommended
* meta tag has recommended-content as its value.
*
*/
public class TestRecommendedParser extends TestCase {
private static final File testDir =
new File("H:/project/SearchEngine/Nutch1.2/src/plugin/recommended/data");
public void testPages() throws Exception {
pageTest(new File(testDir, "recommended.html"), "http://foo.com/",
"recommended-content");
}
public void pageTest(File file, String url, String recommendation)
throws Exception {
String contentType = "text/html";
InputStream in = new FileInputStream(file);
ByteArrayOutputStream out = new ByteArrayOutputStream((int)file.length());
byte[] buffer = new byte[1024];
int i;
while ((i = in.read(buffer)) != -1) {
out.write(buffer, 0, i);
}
in.close();
byte[] bytes = out.toByteArray();
Configuration conf = NutchConfiguration.create();
Content content =
new Content(url, url, bytes, contentType, new Metadata(), conf);
Parse parse = new ParseUtil(conf).parseByExtensionId("parse-html",content).get(content.getUrl());
Metadata metadata = parse.getData().getContentMeta();
assertEquals(recommendation, metadata.get("recommended"));
assertTrue("somesillycontent" != metadata.get("recommended"));
}
}
package org.apache.nutch.parse.recommended;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseResult;
import org.apache.nutch.parse.ParseUtil;
import org.apache.nutch.protocol.Content;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.util.NutchConfiguration;
import java.util.Properties;
import java.io.*;
import java.net.URL;
import junit.framework.TestCase;
/*
* Loads test page recommended.html and verifies that the recommended
* meta tag has recommended-content as its value.
*
*/
public class TestRecommendedParser extends TestCase {
private static final File testDir =
new File("H:/project/SearchEngine/Nutch1.2/src/plugin/recommended/data");
public void testPages() throws Exception {
pageTest(new File(testDir, "recommended.html"), "http://foo.com/",
"recommended-content");
}
public void pageTest(File file, String url, String recommendation)
throws Exception {
String contentType = "text/html";
InputStream in = new FileInputStream(file);
ByteArrayOutputStream out = new ByteArrayOutputStream((int)file.length());
byte[] buffer = new byte[1024];
int i;
while ((i = in.read(buffer)) != -1) {
out.write(buffer, 0, i);
}
in.close();
byte[] bytes = out.toByteArray();
Configuration conf = NutchConfiguration.create();
Content content =
new Content(url, url, bytes, contentType, new Metadata(), conf);
Parse parse = new ParseUtil(conf).parseByExtensionId("parse-html",content).get(content.getUrl());
Metadata metadata = parse.getData().getContentMeta();
assertEquals(recommendation, metadata.get("recommended"));
assertTrue("somesillycontent" != metadata.get("recommended"));
}
}
2.3 用junit运行TestRecommendedParser.java测试。
本文来自CSDN博客,转载请标明出处:http://blog.csdn.net/laigood12345/archive/2010/10/09/5929388.aspx
更多实例:http://www.lsoba.cn
发表评论
-
Nutch内容过滤的实现[转]
2011-01-16 11:51 1892public class ContentFilter impl ... -
nutch 1.2 增量爬取url 完成 recrawl.sh 编写
2011-01-10 21:42 2845# runbot script to run the Nutc ... -
今天执行nutch1.2报错:找不类 PassURLNormalizer(cygwin与在eclipse编程执行时的区别)
2011-01-02 17:25 1184nutch-default.xml <prope ... -
今天在nutch1.2用jsoup解析了一下页面,挺爽的。用起来。
2010-12-30 23:54 2511/** * 此实例用于采集tianya wenda的贴子及 ... -
新手使用帮助:nutch 1.2 导入eclipse
2010-12-30 17:23 3211nutch 为何物,在这儿我就不做介绍了,因为google比我 ... -
重新配置nutch1.2 报 Exception in thread "main" java.io.IOException: Job failed!
2010-12-29 20:47 5588重新配置nutch1.2 报 Exception in thr ... -
nutch 1.2 导入eclipse
2010-12-26 10:47 0下载nutch -
Nutch中metadata的分析
2010-11-09 16:28 1499作为Nutch中的一个非常重要的数据结构,metada ...
相关推荐
1. **导入项目**:在Eclipse中选择“File” > “Import” > “Existing Projects into Workspace”,然后浏览到下载的`nutch1.2+Project`目录,导入项目。 2. **添加库**:确保你的Eclipse环境中已经安装了Apache ...
Nutch 1.2 版本相对于早期版本在性能和稳定性上有所提升,同时也支持更丰富的插件体系。 在描述中提到,这个压缩包包含了一个已经配置好的 Nutch 1.2 Java 工程,但由于文件大小限制,插件部分未能上传。Nutch 的...
### Nutch 1.2 源码阅读深入解析 #### Crawl类核心作用与流程概览 在深入了解Nutch 1.2源码之前,我们先明确Nutch的架构和工作流程。Nutch作为一款开源搜索引擎框架,其功能涵盖网页抓取、索引构建以及查询处理。...
Nutch 1.2是该项目的一个稳定版本,提供了许多改进和优化,使得它在搜索引擎构建、数据分析等领域具有广泛应用。 一、Nutch概述 Nutch是由Apache软件基金会开发的开源Web爬虫项目,主要用于抓取互联网上的网页并...
- 转到 Libraries 标签页,点击 Add Class Folder,选择 `nutch1.2/conf` 目录。 3. **调整库顺序**: - 在 Order and Export 页面,找到 `nutch1.2/conf` 并将其置顶。 - 完成后点击 Finish,完成项目创建。 4...
nutch1.2测试文档
nutch官方简单案例,请版本是nutch-1.2.war
### Windows下cygwin+MyEclipse 8.5+Nutch1.2+Tomcat 6.0 本文旨在详细介绍如何在Windows环境下搭建基于cygwin、MyEclipse 8.5、Nutch 1.2及Tomcat 6.0的开发环境,并对每个步骤进行深入解析。 #### 一、Cygwin的...
nutch Nutch是一个由Java实现的,刚刚诞生开放源代码(open-source)的web搜索引擎。 尽管Web搜索是漫游Internet的基本要求, 但是现有web搜索引擎的数目却在下降。 并且这很有可能进一步演变成为一个公司垄断了几乎...
尝试使用Nutch 0.9和IKAnalyzer 3.1.6GA组合,但由于版本兼容性问题导致失败,因此改用Nutch 1.2和IKAnalyzer 3.2.8,并将Tomcat升级到6.0.35版本。 在Nutch 1.2中集成IKAnalyzer,需要修改NutchAnalysis.jj文件,...
nutch Nutch是一个由Java实现的,刚刚诞生开放源代码(open-source)的web搜索引擎。 尽管Web搜索是漫游Internet的基本要求, 但是现有web搜索引擎的数目却在下降。 并且这很有可能进一步演变成为一个公司垄断了几乎...
- 选择“Source”选项卡,将默认输出目录从`nutch1.2/bin`修改为`nutch1.2/_bin`。 - 对于bin文件夹,可以通过右键点击“Team” > “Restore”来恢复其内容。 3. **添加JAR包** - 通过“Add JARs”功能,将`...
Nutch是Apache开发的一款开源网络爬虫项目,用于抓取互联网上的网页并建立索引,以便于搜索引擎进行数据处理。然而,在实际使用过程中,由于编码问题,Nutch可能会出现部分网页乱码的情况。本篇文章将深入探讨这个...
Nutch 插件机制是其核心功能之一,它允许开发者轻松地扩展Nutch的功能,以适应不同的数据处理需求。Nutch 是一个开源的网络爬虫项目,主要用于收集、索引和搜索互联网上的信息。通过插件机制,Nutch可以处理各种不同...
Nutch中文分词插件的编写与配置,由于Internet的迅猛发展,使得用户查找信息犹如大海捞针,而搜索引擎则能帮用户很好的解决这个问题。 Nutch是用java语言开发的,基于Lucene的完整的网络搜索引擎,并采用插件机制进行...
### Nutch插件深入研究 #### 一、Nutch插件概述 Nutch是一个开源的Web爬虫项目,由Apache软件基金会维护。它基于Hadoop,能够从互联网上抓取和索引网页,构建搜索引擎。Nutch的强大之处在于其高度可定制性,这主要...
例如,增加PDF文件的支持只需要找到或编写相应的插件即可。 3. **可维护性**:插件机制使得每个开发者只需要关注自己的部分,而不需要深入了解整个系统的内部运作细节。这大大简化了内核的维护工作,并且降低了错误...
### Nutch插件开发知识点详解 #### 一、Nutch插件系统概述 Nutch是一款开源的网络爬虫工具,其强大的灵活性与扩展性得益于其独特的插件系统设计。插件(Plugin)作为Nutch的核心组件之一,为用户提供了一种灵活的...
nutch Nutch是一个由Java实现的,刚刚诞生开放源代码(open-source)的web搜索引擎。 尽管Web搜索是漫游Internet的基本要求, 但是现有web搜索引擎的数目却在下降。 并且这很有可能进一步演变成为一个公司垄断了几乎...