nutch源代码分析之Generator

coderplay

浏览: 580114 次
性别:
来自: 广州杭州

最近访客更多访客>>

x_h_j123

liuxiao723846

汀雨晓洛

springcdma

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

lucene&nutch

MapReduce1：选择要获取的urls
[list]

输入：爬虫数据库文件

  public Path generate(...) {
  ...
    job.setInputPath(new Path(dbDir, CrawlDb.CURRENT_NAME));
    job.setInputFormat(SequenceFileInputFormat.class);
  }

Map() -> 如果date <= now, 反转成<CrawlDatum, url>

  /** Selects entries due for fetch. */
  public static class Selector implements Mapper ...{

    private SelectorEntry entry = new SelectorEntry();
   
    /** Select & invert subset due for fetch. */
    public void map(WritableComparable key, Writable value,
                    OutputCollector output, Reporter reporter)
      throws IOException {
      Text url = (Text)key;
      ...
      CrawlDatum crawlDatum = (CrawlDatum)value;

      if (crawlDatum.getStatus() == CrawlDatum.STATUS_DB_GONE ||
          crawlDatum.getStatus() == CrawlDatum.STATUS_DB_REDIR_PERM)
        return;                                   // don't retry

      if (crawlDatum.getFetchTime() > curTime)
        return;                                   // not time yet

      LongWritable oldGenTime = (LongWritable)crawlDatum.getMetaData().get(Nutch.WRITABLE_GENERATE_TIME_KEY);
      if (oldGenTime != null) { // awaiting fetch & update
        if (oldGenTime.get() + genDelay > curTime) // still wait for update
          return;
      }
      ...
      // record generation time
      crawlDatum.getMetaData().put(Nutch.WRITABLE_GENERATE_TIME_KEY, genTime);
      entry.datum = crawlDatum;
      entry.url = (Text)key;
      output.collect(sortValue, entry);          // invert for sort by score
    }
  }

以随机整数为种子, 用hash函数来划分数据块



  /**
   * Generate fetchlists in a segment.
   * @return Path to generated segment or null if no entries were selected.
   * */
  public Path generate(...) {
  ...
  job.setInt("partition.url.by.host.seed", new Random().nextInt());
  }

  public static class Selector implements Mapper, Partitioner, Reducer {

    private Partitioner hostPartitioner = new PartitionUrlByHost();
    ...
    /** Partition by host. */
    public int getPartition(WritableComparable key, Writable value,
                            int numReduceTasks) {
      return hostPartitioner.getPartition(((SelectorEntry)value).url, key,
                                          numReduceTasks);
    }
    ...
  }



/** Partition urls by hostname. */
public class PartitionUrlByHost implements Partitioner {

  private int seed;
  ...

  public void configure(JobConf job) {
    seed = job.getInt("partition.url.by.host.seed", 0);
    ...
  }

  /** Hash by hostname. */
  public int getPartition(WritableComparable key, Writable value,
                          int numReduceTasks) {
  ...
    int hashCode = (url==null ? urlString : url.getHost()).hashCode();

    // make hosts wind up in different partitions on different runs
    hashCode ^= seed;

    return (hashCode & Integer.MAX_VALUE) % numReduceTasks;
  }
}

Reduce()是同一化

以CrawlDatum.linkCount降序排序

输出链接数最多的N个CrawlDatum实体

[/list]

MapReduce2:准备获取

Map()是反向；Partition()根据主机划分；Reduce()是同一化
Reduce: 合并CrawlDatum成单个入口
输出: <url,CrawlDatum>文件集，用来并行地获取

分享到：

nutch源代码分析之Fetcher | nutch源代码分析之Injector

2008-05-20 03:33
浏览 7688
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

nutch源代码分析之Generator

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

nutch源代码分析之Generator

评论

发表评论

相关推荐

几个搜索相关的pdf(lucene, 分词等)

关于本体论及语意搜索的一些资料

最近做的几个项目

lucene2.3.2与2.2.0建索引的速度比较

关于分布式lucene

职友集的搜索

nutch演示

中文分词演示

nutch源代码分析之ParseSegment

nutch源代码分析之Fetcher

nutch源代码分析之Injector

最近访客更多访客>>