`

Nutch1.2增加插件例子[转]

阅读更多

今尝试下给nutch1.2增加一个插件,于是到官网找了个例子,链接如下:

http://wiki.apache.org/nutch/WritingPluginExample-0.9

这个例子实现的的是推荐网站,就是写关键字在content里,当别人搜索这个关键字时,你推荐的网站在搜索结果中排前,要实现推荐必须在你的网页上加上

view plaincopy to clipboardprint?
<meta name="recommended" content="plugins" /> 
<meta name="recommended" content="plugins" />

这条属性才能被插件识别。

由于它这个例子是用nutch0.9的,而且1.2和0.9有些区别,于是要修改一些代码。步骤如下:

1.插件开放

1.1在src/plugin中新建一个文件夹recommend

1.2.在recommend目录下新建Plugin.xml和Build.xml文件,内容如下:

Plugin.xml

view plaincopy to clipboardprint?
<?xml version="1.0" encoding="UTF-8"?> 
<plugin 
   id="recommended" 
   name="Recommended Parser/Filter" 
   version="0.0.1" 
   provider-name="nutch.org"> 
 
   <runtime> 
      <!-- As defined in build.xml this plugin will end up bundled as recommended.jar --> 
      <library name="recommended.jar"> 
         <export name="*"/> 
      </library> 
   </runtime> 
 
   <!-- The RecommendedParser extends the HtmlParseFilter to grab the contents of  
        any recommended meta tags --> 
   <extension id="org.apache.nutch.parse.recommended.recommendedfilter" 
              name="Recommended Parser" 
              point="org.apache.nutch.parse.HtmlParseFilter"> 
      <implementation id="RecommendedParser" 
                      class="org.apache.nutch.parse.recommended.RecommendedParser"/> 
   </extension> 
 
   <!-- TheRecommendedIndexer extends the IndexingFilter in order to add the contents  
        of the recommended meta tags (as found by the RecommendedParser) to the lucene  
        index. --> 
   <extension id="org.apache.nutch.parse.recommended.recommendedindexer" 
              name="Recommended identifier filter" 
              point="org.apache.nutch.indexer.IndexingFilter"> 
      <implementation id="RecommendedIndexer" 
                      class="org.apache.nutch.parse.recommended.RecommendedIndexer"/> 
   </extension> 
 
   <!-- The RecommendedQueryFilter gets called when you perform a search. It runs a  
        search for the user's query against the recommended fields.  In order to get  
        add this to the list of filters that gets run by default, you have to use  
        "fields=DEFAULT". -->     
   <extension id="org.apache.nutch.parse.recommended.recommendedSearcher" 
              name="Recommended Search Query Filter" 
              point="org.apache.nutch.searcher.QueryFilter"> 
      <implementation id="RecommendedQueryFilter" 
                      class="org.apache.nutch.parse.recommended.RecommendedQueryFilter"> 
        <parameter name="fields" value="recommended"/> 
        </implementation> 
   </extension> 
 
</plugin> 
<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="recommended"
   name="Recommended Parser/Filter"
   version="0.0.1"
   provider-name="nutch.org">

   <runtime>
      <!-- As defined in build.xml this plugin will end up bundled as recommended.jar -->
      <library name="recommended.jar">
         <export name="*"/>
      </library>
   </runtime>

   <!-- The RecommendedParser extends the HtmlParseFilter to grab the contents of
        any recommended meta tags -->
   <extension id="org.apache.nutch.parse.recommended.recommendedfilter"
              name="Recommended Parser"
              point="org.apache.nutch.parse.HtmlParseFilter">
      <implementation id="RecommendedParser"
                      class="org.apache.nutch.parse.recommended.RecommendedParser"/>
   </extension>

   <!-- TheRecommendedIndexer extends the IndexingFilter in order to add the contents
        of the recommended meta tags (as found by the RecommendedParser) to the lucene
        index. -->
   <extension id="org.apache.nutch.parse.recommended.recommendedindexer"
              name="Recommended identifier filter"
              point="org.apache.nutch.indexer.IndexingFilter">
      <implementation id="RecommendedIndexer"
                      class="org.apache.nutch.parse.recommended.RecommendedIndexer"/>
   </extension>

   <!-- The RecommendedQueryFilter gets called when you perform a search. It runs a
        search for the user's query against the recommended fields.  In order to get
        add this to the list of filters that gets run by default, you have to use
        "fields=DEFAULT". -->  
   <extension id="org.apache.nutch.parse.recommended.recommendedSearcher"
              name="Recommended Search Query Filter"
              point="org.apache.nutch.searcher.QueryFilter">
      <implementation id="RecommendedQueryFilter"
                      class="org.apache.nutch.parse.recommended.RecommendedQueryFilter">
        <parameter name="fields" value="recommended"/>
        </implementation>
   </extension>

</plugin>

Build.xml

view plaincopy to clipboardprint?
<?xml version="1.0"?> 
 
<project name="recommended" default="jar-core"> 
 
  <import file="../build-plugin.xml"/> 
    
 <!-- Build compilation dependencies --> 
 <target name="deps-jar"> 
   <ant target="jar" inheritall="false" dir="../lib-xml"/> 
 </target> 
 
  <!-- Add compilation dependencies to classpath --> 
 <path id="plugin.deps"> 
   <fileset dir="${nutch.root}/build"> 
     <include name="**/lib-xml/*.jar" /> 
   </fileset> 
 </path> 
 
  <!-- Deploy Unit test dependencies --> 
 <target name="deps-test"> 
   <ant target="deploy" inheritall="false" dir="../lib-xml"/> 
   <ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/> 
   <ant target="deploy" inheritall="false" dir="../protocol-file"/> 
 </target> 
 
   
  <!-- for junit test --> 
  <mkdir dir="${build.test}/data"/> 
  <copy file="data/recommended.html" todir="${build.test}/data"/> 
</project> 
<?xml version="1.0"?>

<project name="recommended" default="jar-core">

  <import file="../build-plugin.xml"/>
 
 <!-- Build compilation dependencies -->
 <target name="deps-jar">
   <ant target="jar" inheritall="false" dir="../lib-xml"/>
 </target>

  <!-- Add compilation dependencies to classpath -->
 <path id="plugin.deps">
   <fileset dir="${nutch.root}/build">
     <include name="**/lib-xml/*.jar" />
   </fileset>
 </path>

  <!-- Deploy Unit test dependencies -->
 <target name="deps-test">
   <ant target="deploy" inheritall="false" dir="../lib-xml"/>
   <ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
   <ant target="deploy" inheritall="false" dir="../protocol-file"/>
 </target>

 
  <!-- for junit test -->
  <mkdir dir="${build.test}/data"/>
  <copy file="data/recommended.html" todir="${build.test}/data"/>
</project>

1.3.在recommended目录下建立\src\java\org\apache\nutch\parse\recommended目录。

1.4.增加RecommendedIndexer.java,RecommendedParser.java,RecommendedQueryFilter.java三个类,内容如下:

RecommendedIndexer.java

view plaincopy to clipboardprint?
package org.apache.nutch.parse.recommended;  
 
// JDK import  
import java.util.logging.Logger;  
 
// Commons imports  
import org.apache.commons.logging.Log;  
import org.apache.commons.logging.LogFactory;  
 
 
// Nutch imports  
import org.apache.nutch.util.LogUtil;  
import org.apache.nutch.fetcher.FetcherOutput;  
import org.apache.nutch.indexer.IndexingFilter;  
import org.apache.nutch.indexer.IndexingException;  
import org.apache.nutch.indexer.NutchDocument;  
import org.apache.nutch.parse.Parse;  
 
import org.apache.hadoop.conf.Configuration;  
import org.apache.hadoop.io.Text;  
import org.apache.nutch.crawl.CrawlDatum;  
import org.apache.nutch.crawl.Inlinks;  
 
// Lucene imports  
import org.apache.lucene.document.Field;  
import org.apache.lucene.document.Document;  
 
public class RecommendedIndexer implements IndexingFilter {  
      
  public static final Log LOG = LogFactory.getLog(RecommendedIndexer.class.getName());  
    
  private Configuration conf;  
    
  public RecommendedIndexer() {  
  }  
  @Override 
  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,   
    CrawlDatum datum, Inlinks inlinks)  
    throws IndexingException {  
 
    String recommendation = parse.getData().getMeta("recommended");  
 
        if (recommendation != null) {  
            Field recommendedField =   
                new Field("recommended", recommendation,   
                    Field.Store.YES, Field.Index.NOT_ANALYZED);  
            recommendedField.setBoost(5.0f);  
            doc.add("recommended",recommendedField);  
            LOG.info("Added " + recommendation + " to the recommended Field");  
        }  
 
    return doc;  
  }  
    
  public void setConf(Configuration conf) {  
    this.conf = conf;  
  }  
 
  public Configuration getConf() {  
    return this.conf;  
  }  
 
@Override 
public void addIndexBackendOptions(Configuration conf) {  
    // TODO Auto-generated method stub  
}  

package org.apache.nutch.parse.recommended;

// JDK import
import java.util.logging.Logger;

// Commons imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;


// Nutch imports
import org.apache.nutch.util.LogUtil;
import org.apache.nutch.fetcher.FetcherOutput;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.parse.Parse;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;

// Lucene imports
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Document;

public class RecommendedIndexer implements IndexingFilter {
   
  public static final Log LOG = LogFactory.getLog(RecommendedIndexer.class.getName());
 
  private Configuration conf;
 
  public RecommendedIndexer() {
  }
  @Override
  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
    CrawlDatum datum, Inlinks inlinks)
    throws IndexingException {

    String recommendation = parse.getData().getMeta("recommended");

        if (recommendation != null) {
            Field recommendedField =
                new Field("recommended", recommendation,
                    Field.Store.YES, Field.Index.NOT_ANALYZED);
            recommendedField.setBoost(5.0f);
            doc.add("recommended",recommendedField);
            LOG.info("Added " + recommendation + " to the recommended Field");
        }

    return doc;
  }
 
  public void setConf(Configuration conf) {
    this.conf = conf;
  }

  public Configuration getConf() {
    return this.conf;
  }

@Override
public void addIndexBackendOptions(Configuration conf) {
 // TODO Auto-generated method stub
}
}

 

RecommendedParser.java

view plaincopy to clipboardprint?
package org.apache.nutch.parse.recommended;  
 
// JDK imports  
import java.util.Enumeration;  
import java.util.Properties;  
import java.util.logging.Logger;  
 
// Nutch imports  
import org.apache.hadoop.conf.Configuration;  
import org.apache.nutch.metadata.Metadata;  
import org.apache.nutch.parse.HTMLMetaTags;  
import org.apache.nutch.parse.Parse;  
import org.apache.nutch.parse.HtmlParseFilter;  
import org.apache.nutch.parse.ParseResult;  
import org.apache.nutch.protocol.Content;  
 
// Commons imports  
import org.apache.commons.logging.Log;  
import org.apache.commons.logging.LogFactory;  
 
// W3C imports  
import org.w3c.dom.DocumentFragment;  
 
public class RecommendedParser implements HtmlParseFilter {  
 
  private static final Log LOG = LogFactory.getLog(RecommendedParser.class.getName());  
    
  private Configuration conf;  
 
  /** The Recommended meta data attribute name */ 
  public static final String META_RECOMMENDED_NAME="recommended";  
 
  /** 
   * Scan the HTML document looking for a recommended meta tag. 
   */ 
    
  @Override 
  public ParseResult filter(Content content, ParseResult parseResult,  
    HTMLMetaTags metaTags, DocumentFragment doc) {  
    // Trying to find the document's recommended term  
    String recommendation = null;  
 
    Properties generalMetaTags = metaTags.getGeneralTags();  
 
    for (Enumeration tagNames = generalMetaTags.propertyNames(); tagNames.hasMoreElements(); ) {  
        if (tagNames.nextElement().equals("recommended")) {  
            System.out.println(generalMetaTags.getProperty("recommended"));  
            recommendation = generalMetaTags.getProperty("recommended");  
           LOG.info("Found a Recommendation for " + recommendation);  
        }  
    }  
 
    if (recommendation == null) {  
        LOG.info("No Recommendation");  
    } else {  
        LOG.info("Adding Recommendation for " + recommendation);  
        Parse parse = parseResult.get(content.getUrl());  
          
        parse.getData().getContentMeta().set(META_RECOMMENDED_NAME, recommendation);  
    }  
 
    return parseResult;  
  }  
    
  public void setConf(Configuration conf) {  
    this.conf = conf;  
  }  
 
  public Configuration getConf() {  
    return this.conf;  
  }  
 
 
 

package org.apache.nutch.parse.recommended;

// JDK imports
import java.util.Enumeration;
import java.util.Properties;
import java.util.logging.Logger;

// Nutch imports
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.HTMLMetaTags;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.HtmlParseFilter;
import org.apache.nutch.parse.ParseResult;
import org.apache.nutch.protocol.Content;

// Commons imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

// W3C imports
import org.w3c.dom.DocumentFragment;

public class RecommendedParser implements HtmlParseFilter {

  private static final Log LOG = LogFactory.getLog(RecommendedParser.class.getName());
 
  private Configuration conf;

  /** The Recommended meta data attribute name */
  public static final String META_RECOMMENDED_NAME="recommended";

  /**
   * Scan the HTML document looking for a recommended meta tag.
   */
 
  @Override
  public ParseResult filter(Content content, ParseResult parseResult,
    HTMLMetaTags metaTags, DocumentFragment doc) {
    // Trying to find the document's recommended term
    String recommendation = null;

    Properties generalMetaTags = metaTags.getGeneralTags();

    for (Enumeration tagNames = generalMetaTags.propertyNames(); tagNames.hasMoreElements(); ) {
        if (tagNames.nextElement().equals("recommended")) {
            System.out.println(generalMetaTags.getProperty("recommended"));
         recommendation = generalMetaTags.getProperty("recommended");
           LOG.info("Found a Recommendation for " + recommendation);
        }
    }

    if (recommendation == null) {
        LOG.info("No Recommendation");
    } else {
        LOG.info("Adding Recommendation for " + recommendation);
        Parse parse = parseResult.get(content.getUrl());
       
        parse.getData().getContentMeta().set(META_RECOMMENDED_NAME, recommendation);
    }

    return parseResult;
  }
 
  public void setConf(Configuration conf) {
    this.conf = conf;
  }

  public Configuration getConf() {
    return this.conf;
  }

 

}

RecommendedQueryFilter.java

view plaincopy to clipboardprint?
package org.apache.nutch.parse.recommended;  
 
import org.apache.nutch.searcher.FieldQueryFilter;  
 
import java.util.logging.Logger;  
 
// Commons imports  
import org.apache.commons.logging.Log;  
import org.apache.commons.logging.LogFactory;  
 
 
public class RecommendedQueryFilter extends FieldQueryFilter {  
    private static final Log LOG = LogFactory.getLog(RecommendedParser.class.getName());  
 
    public RecommendedQueryFilter() {  
        super("recommended", 5f);  
        LOG.info("Added a recommended query");  
    }  
    

package org.apache.nutch.parse.recommended;

import org.apache.nutch.searcher.FieldQueryFilter;

import java.util.logging.Logger;

// Commons imports
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;


public class RecommendedQueryFilter extends FieldQueryFilter {
    private static final Log LOG = LogFactory.getLog(RecommendedParser.class.getName());

    public RecommendedQueryFilter() {
        super("recommended", 5f);
        LOG.info("Added a recommended query");
    }
 
}

1.5.在 src/plugin/build.xml 中的<target name="deploy"></target>中增加一行:

view plaincopy to clipboardprint?
<ant dir="recommended" target="deploy" /> 
<ant dir="recommended" target="deploy" />

1.6.运行cmd,切换到recommend目录,运行ant命令编译,插件开发完成。

1.7 让nutch识别你的插件

      在conf/nutch-site.xml 中增加一下属性

view plaincopy to clipboardprint?
<property>  
  <name>plugin.includes</name>  
  <value>recommended|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>  <description>Regular expression naming plugin id names to  
  include.  Any plugin not matching this expression is excluded.  
  In any case you need at least include the nutch-extensionpoints plugin. By  
  default Nutch includes crawling just HTML and plain text via HTTP,  
  and basic indexing and search plugins.  
  </description>  
</property> 
<property>
  <name>plugin.includes</name>
  <value>recommended|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>  <description>Regular expression naming plugin id names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

2.编写插件测试类

2.1 在src/plugin中/recommend目录下新建一个data目录,在data目录下新建一个html文件recommended.html内容如下:

view plaincopy to clipboardprint?
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"> 
 
<html lang="en"> 
<head> 
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> 
    <title>recommended</title> 
    <meta name="generator" content="TextMate http://macromates.com/"> 
    <meta name="author" content="Ricardo J. Méndez"> 
    <meta name="recommended" content="recommended-content"/> 
    <!-- Date: 2007-02-12 --> 
</head> 
<body> 
    Recommended meta tag test.  
</body> 
</html> 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">

<html lang="en">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <title>recommended</title>
    <meta name="generator" content="TextMate http://macromates.com/">
    <meta name="author" content="Ricardo J. Méndez">
    <meta name="recommended" content="recommended-content"/>
    <!-- Date: 2007-02-12 -->
</head>
<body>
    Recommended meta tag test.
</body>
</html>

2.2 在src/plugin中/recommend目录下新建src/test/org/apache/nutch/parse/recommended目录,增加TestRecommendedParser.java类,内容如下:

view plaincopy to clipboardprint?
package org.apache.nutch.parse.recommended;  
 
 
import org.apache.nutch.metadata.Metadata;  
import org.apache.nutch.parse.Parse;  
import org.apache.nutch.parse.ParseResult;  
import org.apache.nutch.parse.ParseUtil;  
import org.apache.nutch.protocol.Content;  
import org.apache.hadoop.conf.Configuration;  
import org.apache.nutch.util.NutchConfiguration;  
 
import java.util.Properties;  
import java.io.*;  
import java.net.URL;  
 
import junit.framework.TestCase;  
 
/*  
 * Loads test page recommended.html and verifies that the recommended   
 * meta tag has recommended-content as its value.  
 *  
 */  
public class TestRecommendedParser extends TestCase {  
 
  private static final File testDir =  
    new File("H:/project/SearchEngine/Nutch1.2/src/plugin/recommended/data");  
 
  public void testPages() throws Exception {  
    pageTest(new File(testDir, "recommended.html"), "http://foo.com/",  
             "recommended-content");  
 
  }  
 
 
  public void pageTest(File file, String url, String recommendation)  
    throws Exception {  
 
    String contentType = "text/html";  
    InputStream in = new FileInputStream(file);  
      
    ByteArrayOutputStream out = new ByteArrayOutputStream((int)file.length());  
    byte[] buffer = new byte[1024];  
    int i;  
    while ((i = in.read(buffer)) != -1) {  
      out.write(buffer, 0, i);  
    }  
    in.close();  
    byte[] bytes = out.toByteArray();  
    Configuration conf = NutchConfiguration.create();  
 
    Content content =  
      new Content(url, url, bytes, contentType, new Metadata(), conf);  
      
    Parse parse = new ParseUtil(conf).parseByExtensionId("parse-html",content).get(content.getUrl());  
      
    Metadata metadata = parse.getData().getContentMeta();  
    
    assertEquals(recommendation, metadata.get("recommended"));  
    assertTrue("somesillycontent" != metadata.get("recommended"));  
  }  
    

package org.apache.nutch.parse.recommended;


import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseResult;
import org.apache.nutch.parse.ParseUtil;
import org.apache.nutch.protocol.Content;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.util.NutchConfiguration;

import java.util.Properties;
import java.io.*;
import java.net.URL;

import junit.framework.TestCase;

/*
 * Loads test page recommended.html and verifies that the recommended
 * meta tag has recommended-content as its value.
 *
 */
public class TestRecommendedParser extends TestCase {

  private static final File testDir =
    new File("H:/project/SearchEngine/Nutch1.2/src/plugin/recommended/data");

  public void testPages() throws Exception {
    pageTest(new File(testDir, "recommended.html"), "http://foo.com/",
             "recommended-content");

  }


  public void pageTest(File file, String url, String recommendation)
    throws Exception {

    String contentType = "text/html";
    InputStream in = new FileInputStream(file);
   
    ByteArrayOutputStream out = new ByteArrayOutputStream((int)file.length());
    byte[] buffer = new byte[1024];
    int i;
    while ((i = in.read(buffer)) != -1) {
      out.write(buffer, 0, i);
    }
    in.close();
    byte[] bytes = out.toByteArray();
    Configuration conf = NutchConfiguration.create();

    Content content =
      new Content(url, url, bytes, contentType, new Metadata(), conf);
   
    Parse parse = new ParseUtil(conf).parseByExtensionId("parse-html",content).get(content.getUrl());
   
    Metadata metadata = parse.getData().getContentMeta();
 
    assertEquals(recommendation, metadata.get("recommended"));
    assertTrue("somesillycontent" != metadata.get("recommended"));
  }
 
}

2.3 用junit运行TestRecommendedParser.java测试。

 

本文来自CSDN博客,转载请标明出处:http://blog.csdn.net/laigood12345/archive/2010/10/09/5929388.aspx

 

更多实例:http://www.lsoba.cn

分享到:
评论

相关推荐

    nutch1.2 java的project

    1. **导入项目**:在Eclipse中选择“File” &gt; “Import” &gt; “Existing Projects into Workspace”,然后浏览到下载的`nutch1.2+Project`目录,导入项目。 2. **添加库**:确保你的Eclipse环境中已经安装了Apache ...

    nutch1.2 java project

    Nutch 1.2 版本相对于早期版本在性能和稳定性上有所提升,同时也支持更丰富的插件体系。 在描述中提到,这个压缩包包含了一个已经配置好的 Nutch 1.2 Java 工程,但由于文件大小限制,插件部分未能上传。Nutch 的...

    Nutch 1.2源码阅读

    ### Nutch 1.2 源码阅读深入解析 #### Crawl类核心作用与流程概览 在深入了解Nutch 1.2源码之前,我们先明确Nutch的架构和工作流程。Nutch作为一款开源搜索引擎框架,其功能涵盖网页抓取、索引构建以及查询处理。...

    nutch1.2源码

    Nutch 1.2是该项目的一个稳定版本,提供了许多改进和优化,使得它在搜索引擎构建、数据分析等领域具有广泛应用。 一、Nutch概述 Nutch是由Apache软件基金会开发的开源Web爬虫项目,主要用于抓取互联网上的网页并...

    myeclipse8.5导入nutch1.2源码

    - 转到 Libraries 标签页,点击 Add Class Folder,选择 `nutch1.2/conf` 目录。 3. **调整库顺序**: - 在 Order and Export 页面,找到 `nutch1.2/conf` 并将其置顶。 - 完成后点击 Finish,完成项目创建。 4...

    nutch1.2测试文档

    nutch1.2测试文档

    nutch-1.2.war

    nutch官方简单案例,请版本是nutch-1.2.war

    Windows下cygwin+MyEclipse 8.5+Nutch1.2+Tomcat 6.0

    ### Windows下cygwin+MyEclipse 8.5+Nutch1.2+Tomcat 6.0 本文旨在详细介绍如何在Windows环境下搭建基于cygwin、MyEclipse 8.5、Nutch 1.2及Tomcat 6.0的开发环境,并对每个步骤进行深入解析。 #### 一、Cygwin的...

    nutch-1.2.part02

    nutch Nutch是一个由Java实现的,刚刚诞生开放源代码(open-source)的web搜索引擎。 尽管Web搜索是漫游Internet的基本要求, 但是现有web搜索引擎的数目却在下降。 并且这很有可能进一步演变成为一个公司垄断了几乎...

    实验报告(利用Nutch和IKanalyzer构造中文分词搜索引擎)

    尝试使用Nutch 0.9和IKAnalyzer 3.1.6GA组合,但由于版本兼容性问题导致失败,因此改用Nutch 1.2和IKAnalyzer 3.2.8,并将Tomcat升级到6.0.35版本。 在Nutch 1.2中集成IKAnalyzer,需要修改NutchAnalysis.jj文件,...

    nutch-1.2.part06

    nutch Nutch是一个由Java实现的,刚刚诞生开放源代码(open-source)的web搜索引擎。 尽管Web搜索是漫游Internet的基本要求, 但是现有web搜索引擎的数目却在下降。 并且这很有可能进一步演变成为一个公司垄断了几乎...

    Nutch搜索引擎培训讲义

    - 选择“Source”选项卡,将默认输出目录从`nutch1.2/bin`修改为`nutch1.2/_bin`。 - 对于bin文件夹,可以通过右键点击“Team” &gt; “Restore”来恢复其内容。 3. **添加JAR包** - 通过“Add JARs”功能,将`...

    nutch部分网页乱码BUG修正

    Nutch是Apache开发的一款开源网络爬虫项目,用于抓取互联网上的网页并建立索引,以便于搜索引擎进行数据处理。然而,在实际使用过程中,由于编码问题,Nutch可能会出现部分网页乱码的情况。本篇文章将深入探讨这个...

    nutch的插件机制

    Nutch 插件机制是其核心功能之一,它允许开发者轻松地扩展Nutch的功能,以适应不同的数据处理需求。Nutch 是一个开源的网络爬虫项目,主要用于收集、索引和搜索互联网上的信息。通过插件机制,Nutch可以处理各种不同...

    Nutch中文分词插件的编写与配置

    Nutch中文分词插件的编写与配置,由于Internet的迅猛发展,使得用户查找信息犹如大海捞针,而搜索引擎则能帮用户很好的解决这个问题。 Nutch是用java语言开发的,基于Lucene的完整的网络搜索引擎,并采用插件机制进行...

    Nutch_插件深入研究

    ### Nutch插件深入研究 #### 一、Nutch插件概述 Nutch是一个开源的Web爬虫项目,由Apache软件基金会维护。它基于Hadoop,能够从互联网上抓取和索引网页,构建搜索引擎。Nutch的强大之处在于其高度可定制性,这主要...

    Nutch插件机制

    例如,增加PDF文件的支持只需要找到或编写相应的插件即可。 3. **可维护性**:插件机制使得每个开发者只需要关注自己的部分,而不需要深入了解整个系统的内部运作细节。这大大简化了内核的维护工作,并且降低了错误...

    Nutch插件开发文档

    ### Nutch插件开发知识点详解 #### 一、Nutch插件系统概述 Nutch是一款开源的网络爬虫工具,其强大的灵活性与扩展性得益于其独特的插件系统设计。插件(Plugin)作为Nutch的核心组件之一,为用户提供了一种灵活的...

    nutch-1.2.part07

    nutch Nutch是一个由Java实现的,刚刚诞生开放源代码(open-source)的web搜索引擎。 尽管Web搜索是漫游Internet的基本要求, 但是现有web搜索引擎的数目却在下降。 并且这很有可能进一步演变成为一个公司垄断了几乎...

Global site tag (gtag.js) - Google Analytics